{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2949abe4-dfa6-4c4a-a395-a3a1db92b5e7",
   "metadata": {},
   "source": [
    "## Create _C.virginia_ gene BED file which _includes_ pseudogenes.\n",
    "\n",
    "### Resulting gene BED file will be used for [_C.virginica_ RNAseq/DML sex/OA project](https://github.com/epigeneticstoocean/2018_L18-adult-methylation) (GitHub repo)\n",
    "\n",
    "This notebook relies on [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) to be installed and available in your `$PATH`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee0ebae6-d54c-4d18-88bd-3d3456a8b1e6",
   "metadata": {},
   "source": [
    "### List computer specs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9f60016c-d6b6-4b6d-86d3-5ee68b55464c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TODAY'S DATE:\n",
      "Mon Sep 26 01:12:01 PM PDT 2022\n",
      "------------\n",
      "\n",
      "Distributor ID:\tUbuntu\n",
      "Description:\tUbuntu 22.04.1 LTS\n",
      "Release:\t22.04\n",
      "Codename:\tjammy\n",
      "\n",
      "------------\n",
      "HOSTNAME: \n",
      "computer\n",
      "\n",
      "------------\n",
      "Computer Specs:\n",
      "\n",
      "Architecture:                    x86_64\n",
      "CPU op-mode(s):                  32-bit, 64-bit\n",
      "Address sizes:                   45 bits physical, 48 bits virtual\n",
      "Byte Order:                      Little Endian\n",
      "CPU(s):                          8\n",
      "On-line CPU(s) list:             0-7\n",
      "Vendor ID:                       GenuineIntel\n",
      "Model name:                      Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz\n",
      "CPU family:                      6\n",
      "Model:                           165\n",
      "Thread(s) per core:              1\n",
      "Core(s) per socket:              1\n",
      "Socket(s):                       8\n",
      "Stepping:                        2\n",
      "BogoMIPS:                        4800.01\n",
      "Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat flush_l1d arch_capabilities\n",
      "Hypervisor vendor:               VMware\n",
      "Virtualization type:             full\n",
      "L1d cache:                       256 KiB (8 instances)\n",
      "L1i cache:                       256 KiB (8 instances)\n",
      "L2 cache:                        2 MiB (8 instances)\n",
      "L3 cache:                        128 MiB (8 instances)\n",
      "NUMA node(s):                    1\n",
      "NUMA node0 CPU(s):               0-7\n",
      "Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported\n",
      "Vulnerability L1tf:              Mitigation; PTE Inversion\n",
      "Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\n",
      "Vulnerability Meltdown:          Mitigation; PTI\n",
      "Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\n",
      "Vulnerability Retbleed:          Mitigation; IBRS\n",
      "Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\n",
      "Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization\n",
      "Vulnerability Spectre v2:        Mitigation; IBRS, IBPB conditional, RSB filling\n",
      "Vulnerability Srbds:             Unknown: Dependent on hypervisor status\n",
      "Vulnerability Tsx async abort:   Not affected\n",
      "\n",
      "------------\n",
      "\n",
      "Memory Specs\n",
      "\n",
      "               total        used        free      shared  buff/cache   available\n",
      "Mem:            54Gi       6.4Gi        43Gi       201Mi       5.2Gi        47Gi\n",
      "Swap:          2.0Gi          0B       2.0Gi\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "No LSB modules are available.\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "echo \"TODAY'S DATE:\"\n",
    "date\n",
    "echo \"------------\"\n",
    "echo \"\"\n",
    "#Display operating system info\n",
    "lsb_release -a\n",
    "echo \"\"\n",
    "echo \"------------\"\n",
    "echo \"HOSTNAME: \"; hostname \n",
    "echo \"\"\n",
    "echo \"------------\"\n",
    "echo \"Computer Specs:\"\n",
    "echo \"\"\n",
    "lscpu\n",
    "echo \"\"\n",
    "echo \"------------\"\n",
    "echo \"\"\n",
    "echo \"Memory Specs\"\n",
    "echo \"\"\n",
    "free -mh"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19866f68-bfad-4a83-adb5-24e271e29d06",
   "metadata": {},
   "source": [
    "### Set variables\n",
    "- `%env` indicates a bash variable\n",
    "\n",
    "- without `%env` is Python variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7293bcb0-581c-4ad2-8f1e-09dd98352aaf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "env: data_dir=/home/sam/data/C_virginica/igv_tracks\n",
      "env: analysis_dir=/home/sam/analyses/20220926-cvir-gff-to-bed-genes_and_pseudogenes\n",
      "env: orig_gff=GCF_002022765.2_C_virginica-3.0_genomic.gff\n",
      "env: orig_gff_gz=GCF_002022765.2_C_virginica-3.0_genomic.gff.gz\n",
      "env: orig_gff_url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0\n",
      "env: gtf_extractor_output_genes=20220926-cvir-GCF_002022765.2-genes.bed\n",
      "env: gtf_extractor_output_pseudogenes=20220926-cvir-GCF_002022765.2-pseudogenes.bed\n",
      "env: genes_and_psuedogenes=20220926-cvir-GCF_002022765.2-genes-and-pseudogenes.bed\n"
     ]
    }
   ],
   "source": [
    "# Set directories, input/output files\n",
    "%env data_dir=/home/sam/data/C_virginica/igv_tracks\n",
    "%env analysis_dir=/home/sam/analyses/20220926-cvir-gff-to-bed-genes_and_pseudogenes\n",
    "analysis_dir=\"20220926-cvir-gff-to-bed-genes_and_pseudogenes\"\n",
    "\n",
    "# Input GFF (from NCBI)\n",
    "%env orig_gff=GCF_002022765.2_C_virginica-3.0_genomic.gff\n",
    "%env orig_gff_gz=GCF_002022765.2_C_virginica-3.0_genomic.gff.gz\n",
    "%env orig_gff_url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0\n",
    "\n",
    "# GTF extractor output\n",
    "%env gtf_extractor_output_genes=20220926-cvir-GCF_002022765.2-genes.bed\n",
    "%env gtf_extractor_output_pseudogenes=20220926-cvir-GCF_002022765.2-pseudogenes.bed\n",
    "%env genes_and_psuedogenes=20220926-cvir-GCF_002022765.2-genes-and-pseudogenes.bed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56052f6d-441a-4048-8a6f-39d58552283d",
   "metadata": {},
   "source": [
    "### Download GFF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "951fc8e9-b821-4f54-848f-f9573daadc83",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-rw-r-- 1 sam sam 412M Dec 10  2019 GCF_002022765.2_C_virginica-3.0_genomic.gff\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "gzip: GCF_002022765.2_C_virginica-3.0_genomic.gff already exists;\tnot overwritten\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${data_dir}\"\n",
    "\n",
    "# Download with wget.\n",
    "# Use --quiet option to prevent wget output from printing too many lines to notebook\n",
    "# Use --continue to prevent re-downloading fie if it's already been downloaded.\n",
    "wget --quiet \\\n",
    "--continue \\\n",
    "${orig_gff_url}/${orig_gff_gz}\n",
    "\n",
    "# Unzip downloaded GFF, if it exists\n",
    "if [[ -f \"${orig_gff_gz}\" ]]; then\n",
    "  gunzip \"${orig_gff_gz}\"\n",
    "fi\n",
    "\n",
    "ls -ltrh \"${orig_gff}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7eb8b7c0-5927-44c5-ba79-cbee1d5a77fb",
   "metadata": {},
   "source": [
    "### Examine GFF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "18862291-d1ec-4b62-8b22-959404538a7f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "##gff-version 3\n",
      "#!gff-spec-version 1.21\n",
      "#!processor NCBI annotwriter\n",
      "#!genome-build C_virginica-3.0\n",
      "#!genome-build-accession NCBI_Assembly:GCF_002022765.2\n",
      "#!annotation-source NCBI Crassostrea virginica Annotation Release 100\n",
      "##sequence-region NC_035780.1 1 65668440\n",
      "##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6565\n",
      "NC_035780.1\tRefSeq\tregion\t1\t65668440\t.\t+\t.\tID=NC_035780.1:1..65668440;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample\n",
      "NC_035780.1\tGnomon\tgene\t13578\t14594\t.\t+\t.\tID=gene-LOC111116054;Dbxref=GeneID:111116054;Name=LOC111116054;gbkey=Gene;gene=LOC111116054;gene_biotype=lncRNA\n",
      "NC_035780.1\tGnomon\tlnc_RNA\t13578\t14594\t.\t+\t.\tID=rna-XR_002636969.1;Parent=gene-LOC111116054;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n",
      "NC_035780.1\tGnomon\texon\t13578\t13603\t.\t+\t.\tID=exon-XR_002636969.1-1;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n",
      "NC_035780.1\tGnomon\texon\t14237\t14290\t.\t+\t.\tID=exon-XR_002636969.1-2;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n",
      "NC_035780.1\tGnomon\texon\t14557\t14594\t.\t+\t.\tID=exon-XR_002636969.1-3;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n",
      "NC_035780.1\tGnomon\tgene\t28961\t33324\t.\t+\t.\tID=gene-LOC111126949;Dbxref=GeneID:111126949;Name=LOC111126949;gbkey=Gene;gene=LOC111126949;gene_biotype=protein_coding\n",
      "NC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna-XM_022471938.1;Parent=gene-LOC111126949;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\n",
      "NC_035780.1\tGnomon\texon\t28961\t29073\t.\t+\t.\tID=exon-XM_022471938.1-1;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n",
      "NC_035780.1\tGnomon\texon\t30524\t31557\t.\t+\t.\tID=exon-XM_022471938.1-2;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n",
      "NC_035780.1\tGnomon\texon\t31736\t31887\t.\t+\t.\tID=exon-XM_022471938.1-3;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n",
      "NC_035780.1\tGnomon\texon\t31977\t32565\t.\t+\t.\tID=exon-XM_022471938.1-4;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "head -n 20 \"${data_dir}\"/\"${orig_gff}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f389b71-582f-41e5-b9a0-bd2491d968b0",
   "metadata": {},
   "source": [
    "### Use [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) to extract gene and pseudo gene features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3dc274bf-228a-4d08-bd30-de6872a9ecc0",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "real\t0m59.836s\n",
      "user\t0m59.627s\n",
      "sys\t0m0.208s\n",
      "\n",
      "real\t0m59.366s\n",
      "user\t0m59.243s\n",
      "sys\t0m0.112s\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "# Make analysis directory, if it doesn't exist\n",
    "mkdir --parents \"${analysis_dir}\"\n",
    "\n",
    "# Extract gene or pseudogene features\n",
    "# Extract chrom,start,end,gene=,and strand fields\n",
    "# \"gene=\" is the NCBI gene name, in this particular instance\n",
    "# Specify input as GFF\n",
    "# Use awk to to insert a \"score\" column before the strand column ($5)\n",
    "# and fill new \"score\" column with arbitrary value of \"0\"\n",
    "time \\\n",
    "gtf_extract \\\n",
    "--feature gene \\\n",
    "--fields=chrom,start,end,ID,strand \\\n",
    "--gff ${data_dir}/${orig_gff} \\\n",
    "| awk 'BEGIN{FS=OFS=\"\\t\"}{$5 = 0 OFS $5}1' \\\n",
    "> ${analysis_dir}/${gtf_extractor_output_genes}\n",
    "\n",
    "time \\\n",
    "gtf_extract \\\n",
    "--feature pseudogene \\\n",
    "--fields=chrom,start,end,ID,strand \\\n",
    "--gff ${data_dir}/${orig_gff} \\\n",
    "| awk 'BEGIN{FS=OFS=\"\\t\"}{$5 = 0 OFS $5}1' \\\n",
    "> ${analysis_dir}/${gtf_extractor_output_pseudogenes}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a718c81a-4fb3-4bab-8dc5-1466ba2452af",
   "metadata": {},
   "source": [
    "#### Check results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "5e368eb2-8d11-414f-be44-f369dc3c3c6d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-rw-r-- 1 sam sam 2.0M Sep 26 13:13 20220926-cvir-GCF_002022765.2-genes.bed\n",
      "-rw-rw-r-- 1 sam sam  34K Sep 26 13:14 20220926-cvir-GCF_002022765.2-pseudogenes.bed\n",
      "\n",
      "==> 20220926-cvir-GCF_002022765.2-genes.bed <==\n",
      "NC_035780.1\t13578\t14594\tgene-LOC111116054\t0\t+\n",
      "NC_035780.1\t28961\t33324\tgene-LOC111126949\t0\t+\n",
      "NC_035780.1\t43111\t66897\tgene-LOC111110729\t0\t-\n",
      "NC_035780.1\t85606\t95254\tgene-LOC111112434\t0\t-\n",
      "NC_035780.1\t99840\t106460\tgene-LOC111120752\t0\t+\n",
      "NC_035780.1\t108305\t110077\tgene-LOC111128944\t0\t-\n",
      "NC_035780.1\t151859\t157536\tgene-LOC111128953\t0\t+\n",
      "NC_035780.1\t163809\t183798\tgene-LOC111105691\t0\t-\n",
      "NC_035780.1\t164820\t166793\tgene-LOC111105685\t0\t+\n",
      "NC_035780.1\t169468\t170178\tgene-LOC111105702\t0\t-\n",
      "\n",
      "==> 20220926-cvir-GCF_002022765.2-pseudogenes.bed <==\n",
      "NC_035780.1\t2215606\t2223571\tgene-LOC111129349\t0\t-\n",
      "NC_035780.1\t2997762\t3000443\tgene-LOC111101388\t0\t+\n",
      "NC_035780.1\t3070311\t3078520\tgene-LOC111129757\t0\t+\n",
      "NC_035780.1\t3376106\t3379058\tgene-LOC111129887\t0\t-\n",
      "NC_035780.1\t3379153\t3381662\tgene-LOC111129920\t0\t-\n",
      "NC_035780.1\t3850672\t3852462\tgene-LOC111125334\t0\t+\n",
      "NC_035780.1\t4397181\t4401310\tgene-LOC111130254\t0\t-\n",
      "NC_035780.1\t6721106\t6745518\tgene-LOC111130754\t0\t+\n",
      "NC_035780.1\t7313912\t7327414\tgene-LOC111128791\t0\t+\n",
      "NC_035780.1\t8003747\t8006376\tgene-LOC111120605\t0\t-\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${analysis_dir}\"\n",
    "ls -ltrh *.bed\n",
    "\n",
    "echo \"\"\n",
    "\n",
    "head *.bed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d40dadc7-7dbd-40ca-9e2c-80f8d93568e4",
   "metadata": {},
   "source": [
    "#### Check that [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) output seem okay"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "c24d1cd0-ac07-42c4-810f-008cf42f7fd5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GFFutils number of extracted genes:\n",
      "38838\n",
      "\n",
      "awk number of extracted genes:\n",
      "38838\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "# Count gene features via GFFutils\n",
    "echo \"GFFutils number of extracted genes:\"\n",
    "gtf_extract --feature gene --fields=ID --gff ${data_dir}/${orig_gff} | wc -l\n",
    "\n",
    "echo \"\"\n",
    "\n",
    "# Count gene features via awk\n",
    "echo \"awk number of extracted genes:\"\n",
    "awk '$3 == \"gene\" { print $0 }' ${data_dir}/${orig_gff} | wc -l\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7431052-7168-416d-b773-20f2d0cf2c08",
   "metadata": {},
   "source": [
    "### Combine and sort BED files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "81259255-8a8e-43d1-bbff-ca233a994be8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "39505 /home/sam/analyses/20220926-cvir-gff-to-bed-genes_and_pseudogenes/20220926-cvir-GCF_002022765.2-genes-and-pseudogenes.bed\n",
      "NC_007175.2\t1\t1623\tgene-COX1\t0\t+\n",
      "NC_007175.2\t2558\t3429\tgene-COX3\t0\t+\n",
      "NC_007175.2\t3647\t4859\tgene-CYTB\t0\t+\n",
      "NC_007175.2\t4897\t5589\tgene-COX2\t0\t+\n",
      "NC_007175.2\t9518\t10192\tgene-ATP6\t0\t+\n",
      "NC_007175.2\t10295\t11290\tgene-ND2\t0\t+\n",
      "NC_007175.2\t11515\t12864\tgene-ND4\t0\t+\n",
      "NC_007175.2\t13126\t14793\tgene-ND5\t0\t+\n",
      "NC_007175.2\t14807\t15268\tgene-ND6\t0\t+\n",
      "NC_007175.2\t15352\t15705\tgene-ND3\t0\t+\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "# Write to temp file \n",
    "cat ${analysis_dir}/${gtf_extractor_output_genes} \\\n",
    "${analysis_dir}/${gtf_extractor_output_pseudogenes} \\\n",
    "> ${analysis_dir}/tmp.txt\n",
    "\n",
    "# Sort the file by chromosome, then\n",
    "sort --version-sort -k1,1 -k2,2 ${analysis_dir}/tmp.txt \\\n",
    "> ${analysis_dir}/${genes_and_psuedogenes}\n",
    "\n",
    "# \n",
    "wc -l ${analysis_dir}/${genes_and_psuedogenes}\n",
    "\n",
    "# Check out combined file\n",
    "head ${analysis_dir}/*genes-and-pseudogenes.bed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24e1a558-e35d-4afc-ae7f-f1a8282b4c06",
   "metadata": {},
   "source": [
    "#### Clean up files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "71a6471d-a602-47ed-a616-53d5482fd95b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 2.0M\n",
      "-rw-rw-r-- 1 sam sam 2.0M Sep 26 13:15 20220926-cvir-GCF_002022765.2-genes-and-pseudogenes.bed\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "rm ${analysis_dir}/tmp.txt ${analysis_dir}/${gtf_extractor_output_genes} ${analysis_dir}/${gtf_extractor_output_pseudogenes}\n",
    "\n",
    "ls -ltrh ${analysis_dir}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0b01c2f-2c68-4821-ae0e-f19c43231119",
   "metadata": {},
   "source": [
    "### Generate checksums"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "9a8e9dcb-eb74-4af4-8924-0553fe6f3596",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "101b49158ac0d7efe103670094118ecc  20220926-cvir-GCF_002022765.2-genes-and-pseudogenes.bed\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "for file in *\n",
    "do\n",
    "  md5sum \"${file}\" | tee --append checksums.md5\n",
    "done"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}