{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "2949abe4-dfa6-4c4a-a395-a3a1db92b5e7",
   "metadata": {},
   "source": [
    "## Create _C.virginia_ long, non-coding RNA files.\n",
    "\n",
    "### Downloads files from NCBI.\n",
    "\n",
    "### Notebook relies on:\n",
    "\n",
    "- [GffRead](https://github.com/gpertea/gffread)\n",
    "\n",
    "- [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) available in your `$PATH`.\n",
    "\n",
    "  - I accomplished this by creating/activating a conda environment for [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) and running this notebook from within that environment.\n",
    "\n",
    "- [samtools](http://www.htslib.org/).\n",
    "\n",
    "### Resulting files will be used for [_C.virginica_ RNAseq/DML sex/OA project](https://github.com/epigeneticstoocean/2018_L18-adult-methylation) (GitHub repo)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee0ebae6-d54c-4d18-88bd-3d3456a8b1e6",
   "metadata": {},
   "source": [
    "### List computer specs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9f60016c-d6b6-4b6d-86d3-5ee68b55464c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TODAY'S DATE:\n",
      "Fri 18 Feb 2022 07:10:15 AM PST\n",
      "------------\n",
      "\n",
      "Distributor ID:\tUbuntu\n",
      "Description:\tUbuntu 20.04.3 LTS\n",
      "Release:\t20.04\n",
      "Codename:\tfocal\n",
      "\n",
      "------------\n",
      "HOSTNAME: \n",
      "computer\n",
      "\n",
      "------------\n",
      "Computer Specs:\n",
      "\n",
      "Architecture:                    x86_64\n",
      "CPU op-mode(s):                  32-bit, 64-bit\n",
      "Byte Order:                      Little Endian\n",
      "Address sizes:                   45 bits physical, 48 bits virtual\n",
      "CPU(s):                          2\n",
      "On-line CPU(s) list:             0,1\n",
      "Thread(s) per core:              1\n",
      "Core(s) per socket:              1\n",
      "Socket(s):                       2\n",
      "NUMA node(s):                    1\n",
      "Vendor ID:                       GenuineIntel\n",
      "CPU family:                      6\n",
      "Model:                           165\n",
      "Model name:                      Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz\n",
      "Stepping:                        2\n",
      "CPU MHz:                         2400.008\n",
      "BogoMIPS:                        4800.01\n",
      "Hypervisor vendor:               VMware\n",
      "Virtualization type:             full\n",
      "L1d cache:                       64 KiB\n",
      "L1i cache:                       64 KiB\n",
      "L2 cache:                        512 KiB\n",
      "L3 cache:                        32 MiB\n",
      "NUMA node0 CPU(s):               0,1\n",
      "Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported\n",
      "Vulnerability L1tf:              Mitigation; PTE Inversion\n",
      "Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\n",
      "Vulnerability Meltdown:          Mitigation; PTI\n",
      "Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\n",
      "Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization\n",
      "Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling\n",
      "Vulnerability Srbds:             Not affected\n",
      "Vulnerability Tsx async abort:   Not affected\n",
      "Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat flush_l1d arch_capabilities\n",
      "\n",
      "------------\n",
      "\n",
      "Memory Specs\n",
      "\n",
      "              total        used        free      shared  buff/cache   available\n",
      "Mem:           54Gi       3.2Gi        46Gi       138Mi       5.1Gi        50Gi\n",
      "Swap:         2.0Gi          0B       2.0Gi\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "No LSB modules are available.\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "echo \"TODAY'S DATE:\"\n",
    "date\n",
    "echo \"------------\"\n",
    "echo \"\"\n",
    "#Display operating system info\n",
    "lsb_release -a\n",
    "echo \"\"\n",
    "echo \"------------\"\n",
    "echo \"HOSTNAME: \"; hostname \n",
    "echo \"\"\n",
    "echo \"------------\"\n",
    "echo \"Computer Specs:\"\n",
    "echo \"\"\n",
    "lscpu\n",
    "echo \"\"\n",
    "echo \"------------\"\n",
    "echo \"\"\n",
    "echo \"Memory Specs\"\n",
    "echo \"\"\n",
    "free -mh"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19866f68-bfad-4a83-adb5-24e271e29d06",
   "metadata": {},
   "source": [
    "### Set variables\n",
    "- `%env` indicates a bash variable\n",
    "\n",
    "- without `%env` is Python variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7293bcb0-581c-4ad2-8f1e-09dd98352aaf",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "env: data_dir=/home/sam/data/C_virginica/genomes\n",
      "env: analysis_dir=/home/sam/analyses/20220217-cvir-lncRNA_subsetting\n",
      "env: ncbi_fasta=GCF_002022765.2_C_virginica-3.0_genomic.fna\n",
      "env: ncbi_fasta_index=GCF_002022765.2_C_virginica-3.0_genomic.fna.fai\n",
      "env: ncbi_fasta_gz=GCF_002022765.2_C_virginica-3.0_genomic.fna.gz\n",
      "env: ncbi_gff=GCF_002022765.2_C_virginica-3.0_genomic.gff\n",
      "env: ncbi_gff_gz=GCF_002022765.2_C_virginica-3.0_genomic.gff.gz\n",
      "env: ncbi_url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0\n",
      "env: lncRNA_bed=GCF_002022765.2_C_virginica-3.0_lncRNA.bed\n",
      "env: lncRNA_gff=GCF_002022765.2_C_virginica-3.0_lncRNA.gff\n",
      "env: lncRNA_gtf=GCF_002022765.2_C_virginica-3.0_lncRNA.gtf\n",
      "env: lncRNA_fasta=GCF_002022765.2_C_virginica-3.0_lncRNA.fa\n",
      "env: lncRNA_fasta_index=GCF_002022765.2_C_virginica-3.0_lncRNA.fa.fai\n",
      "env: gffread=/home/sam/programs/gffread-0.12.7.Linux_x86_64/gffread\n",
      "env: samtools=/home/sam/programs/samtools-1.12/samtools\n"
     ]
    }
   ],
   "source": [
    "# Set directories, input/output files\n",
    "%env data_dir=/home/sam/data/C_virginica/genomes\n",
    "%env analysis_dir=/home/sam/analyses/20220217-cvir-lncRNA_subsetting\n",
    "analysis_dir=\"20220217-cvir-lncRNA_subsetting\"\n",
    "\n",
    "# Input files (from NCBI)\n",
    "%env ncbi_fasta=GCF_002022765.2_C_virginica-3.0_genomic.fna\n",
    "%env ncbi_fasta_index=GCF_002022765.2_C_virginica-3.0_genomic.fna.fai\n",
    "%env ncbi_fasta_gz=GCF_002022765.2_C_virginica-3.0_genomic.fna.gz\n",
    "%env ncbi_gff=GCF_002022765.2_C_virginica-3.0_genomic.gff\n",
    "%env ncbi_gff_gz=GCF_002022765.2_C_virginica-3.0_genomic.gff.gz\n",
    "\n",
    "# URL to download files from NCBI\n",
    "%env ncbi_url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0\n",
    "\n",
    "# Output files\n",
    "%env lncRNA_bed=GCF_002022765.2_C_virginica-3.0_lncRNA.bed\n",
    "%env lncRNA_gff=GCF_002022765.2_C_virginica-3.0_lncRNA.gff\n",
    "%env lncRNA_gtf=GCF_002022765.2_C_virginica-3.0_lncRNA.gtf\n",
    "%env lncRNA_fasta=GCF_002022765.2_C_virginica-3.0_lncRNA.fa\n",
    "%env lncRNA_fasta_index=GCF_002022765.2_C_virginica-3.0_lncRNA.fa.fai\n",
    "\n",
    "# Set program locations\n",
    "%env gffread=/home/sam/programs/gffread-0.12.7.Linux_x86_64/gffread\n",
    "%env samtools=/home/sam/programs/samtools-1.12/samtools"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f204c16-2d1f-4837-93b0-1fb0e3d00d64",
   "metadata": {},
   "source": [
    "### Create analysis directory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8f275e34-c56e-4754-abf7-3279667434bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Make analysis directory, if it doesn't exist\n",
    "mkdir --parents \"${analysis_dir}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56052f6d-441a-4048-8a6f-39d58552283d",
   "metadata": {},
   "source": [
    "### Download GFF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "951fc8e9-b821-4f54-848f-f9573daadc83",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-rw-r-- 1 sam sam 412M Dec 10  2019 GCF_002022765.2_C_virginica-3.0_genomic.gff\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "gzip: GCF_002022765.2_C_virginica-3.0_genomic.gff already exists;\tnot overwritten\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${data_dir}\"\n",
    "\n",
    "# Download with wget.\n",
    "# Use --quiet option to prevent wget output from printing too many lines to notebook\n",
    "# Use --continue to prevent re-downloading fie if it's already been downloaded.\n",
    "wget --quiet \\\n",
    "--continue \\\n",
    "${ncbi_url}/${ncbi_gff_gz}\n",
    "\n",
    "# Unzip download GFF\n",
    "gunzip \"${ncbi_gff_gz}\"\n",
    "\n",
    "ls -ltrh \"${ncbi_gff}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7eb8b7c0-5927-44c5-ba79-cbee1d5a77fb",
   "metadata": {},
   "source": [
    "### Examine GFF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "18862291-d1ec-4b62-8b22-959404538a7f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "##gff-version 3\n",
      "#!gff-spec-version 1.21\n",
      "#!processor NCBI annotwriter\n",
      "#!genome-build C_virginica-3.0\n",
      "#!genome-build-accession NCBI_Assembly:GCF_002022765.2\n",
      "#!annotation-source NCBI Crassostrea virginica Annotation Release 100\n",
      "##sequence-region NC_035780.1 1 65668440\n",
      "##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6565\n",
      "NC_035780.1\tRefSeq\tregion\t1\t65668440\t.\t+\t.\tID=NC_035780.1:1..65668440;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample\n",
      "NC_035780.1\tGnomon\tgene\t13578\t14594\t.\t+\t.\tID=gene-LOC111116054;Dbxref=GeneID:111116054;Name=LOC111116054;gbkey=Gene;gene=LOC111116054;gene_biotype=lncRNA\n",
      "NC_035780.1\tGnomon\tlnc_RNA\t13578\t14594\t.\t+\t.\tID=rna-XR_002636969.1;Parent=gene-LOC111116054;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n",
      "NC_035780.1\tGnomon\texon\t13578\t13603\t.\t+\t.\tID=exon-XR_002636969.1-1;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n",
      "NC_035780.1\tGnomon\texon\t14237\t14290\t.\t+\t.\tID=exon-XR_002636969.1-2;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n",
      "NC_035780.1\tGnomon\texon\t14557\t14594\t.\t+\t.\tID=exon-XR_002636969.1-3;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n",
      "NC_035780.1\tGnomon\tgene\t28961\t33324\t.\t+\t.\tID=gene-LOC111126949;Dbxref=GeneID:111126949;Name=LOC111126949;gbkey=Gene;gene=LOC111126949;gene_biotype=protein_coding\n",
      "NC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna-XM_022471938.1;Parent=gene-LOC111126949;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\n",
      "NC_035780.1\tGnomon\texon\t28961\t29073\t.\t+\t.\tID=exon-XM_022471938.1-1;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n",
      "NC_035780.1\tGnomon\texon\t30524\t31557\t.\t+\t.\tID=exon-XM_022471938.1-2;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n",
      "NC_035780.1\tGnomon\texon\t31736\t31887\t.\t+\t.\tID=exon-XM_022471938.1-3;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n",
      "NC_035780.1\tGnomon\texon\t31977\t32565\t.\t+\t.\tID=exon-XM_022471938.1-4;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "head -n 20 \"${data_dir}\"/\"${ncbi_gff}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27fb9e06-b925-4bf9-a3d7-154ffed294a2",
   "metadata": {},
   "source": [
    "### Download NCBI genomic FastA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "cf72f0a7-3740-45c8-bc74-12433779df5f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-rw-r-- 1 sam sam 662M Dec 10  2019 GCF_002022765.2_C_virginica-3.0_genomic.fna\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "gzip: GCF_002022765.2_C_virginica-3.0_genomic.fna already exists;\tnot overwritten\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${data_dir}\"\n",
    "\n",
    "# Download with wget.\n",
    "# Use --quiet option to prevent wget output from printing too many lines to notebook\n",
    "# Use --continue to prevent re-downloading fie if it's already been downloaded.\n",
    "wget --quiet \\\n",
    "--continue \\\n",
    "${ncbi_url}/${ncbi_fasta_gz}\n",
    "\n",
    "# Unzip download GFF\n",
    "gunzip \"${ncbi_fasta_gz}\"\n",
    "\n",
    "ls -ltrh \"${ncbi_fasta}\"\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9eedff2e-dcaf-4324-8722-45aab3f0a616",
   "metadata": {},
   "source": [
    "### Create FastA index with [Samtools](http://www.htslib.org/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "7fb54c0d-d2a6-4578-8d08-0e5ad17a179a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-rw-r-- 1 sam sam 398 Feb 18 07:10 GCF_002022765.2_C_virginica-3.0_genomic.fna.fai\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${data_dir}\"\n",
    "\n",
    "${samtools} faidx \"${ncbi_fasta}\"\n",
    "\n",
    "ls -ltrh \"${ncbi_fasta_index}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5406fed7-865e-445c-9c75-55b9e8d56a40",
   "metadata": {},
   "source": [
    "### Inspect NCBI genomic FastA index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "07cc3f6b-a512-426f-98c3-c0ac2d92e0b9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NC_035780.1\t65668440\t117\t80\t81\n",
      "NC_035781.1\t61752955\t66489530\t80\t81\n",
      "NC_035782.1\t77061148\t129014514\t80\t81\n",
      "NC_035783.1\t59691872\t207039044\t80\t81\n",
      "NC_035784.1\t98698416\t267477182\t80\t81\n",
      "NC_035785.1\t51258098\t367409446\t80\t81\n",
      "NC_035786.1\t57830854\t419308388\t80\t81\n",
      "NC_035787.1\t75944018\t477862245\t80\t81\n",
      "NC_035788.1\t104168038\t554755681\t80\t81\n",
      "NC_035789.1\t32650045\t660225938\t80\t81\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${data_dir}\"\n",
    "\n",
    "head \"${ncbi_fasta_index}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9827c0d1-ba5c-4ef8-b703-895433a8f3bd",
   "metadata": {},
   "source": [
    "### Extracts lncRNAs from genomic GFF using `gtf_extract` from [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "3dc274bf-228a-4d08-bd30-de6872a9ecc0",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "##gff-version 3\n",
      "#!gff-spec-version 1.21\n",
      "#!processor NCBI annotwriter\n",
      "#!genome-build C_virginica-3.0\n",
      "#!genome-build-accession NCBI_Assembly:GCF_002022765.2\n",
      "#!annotation-source NCBI Crassostrea virginica Annotation Release 100\n",
      "##sequence-region NC_035780.1 1 65668440\n",
      "#!lncRNA only - created by Sam White Fri 18 Feb 2022 07:10:32 AM PST\n",
      "NC_035780.1\tGnomon\tlnc_RNA\t13578\t14594\t.\t+\t.\tID=rna-XR_002636969.1;Parent=gene-LOC111116054;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n",
      "NC_035780.1\tGnomon\tlnc_RNA\t169468\t170178\t.\t-\t.\tID=rna-XR_002635081.1;Parent=gene-LOC111105702;Dbxref=GeneID:111105702,Genbank:XR_002635081.1;Name=XR_002635081.1;gbkey=ncRNA;gene=LOC111105702;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=uncharacterized LOC111105702;transcript_id=XR_002635081.1\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${data_dir}\"\n",
    "\n",
    "# Capture GFF header from NCBI gff\n",
    "head -n 7 \"${ncbi_gff}\" > ${analysis_dir}/\"${lncRNA_gff}\"\n",
    "\n",
    "# Add note about modification\n",
    "printf \"#%s%s\\n\" \"!\" \"lncRNA only - created by Sam White $(date)\" >> ${analysis_dir}/\"${lncRNA_gff}\"\n",
    "\n",
    "\n",
    "# Finds lncRNAs in NCBI GFF\n",
    "gtf_extract \\\n",
    "--feature lnc_RNA \\\n",
    "--gff \"${ncbi_gff}\" \\\n",
    ">> ${analysis_dir}/\"${lncRNA_gff}\"\n",
    "\n",
    "\n",
    "head ${analysis_dir}/\"${lncRNA_gff}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b583b9c-ff1d-430f-95b4-c2171c3cd7dd",
   "metadata": {},
   "source": [
    "### Extract lncRNAs to BED using [GffRead](https://github.com/gpertea/gffread)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "c48e30c2-8241-488e-9db5-bf6ed317be0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "cd \"${data_dir}\"\n",
    "\n",
    "${gffread} --bed \\\n",
    "${analysis_dir}/\"${lncRNA_gff}\" \\\n",
    "> ${analysis_dir}/\"${lncRNA_bed}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd342a69-02c3-4822-83a8-b11276185850",
   "metadata": {},
   "source": [
    "### Inspect lncRNA BED"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "3fef57b2-3b95-4871-a447-44212a885146",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NC_035780.1\t13577\t14594\trna-XR_002636969.1\t100\t+\t13577\t14594\t0,0,0\t1\t1017,\t0,\tgeneID=gene-LOC111116054;gene_name=LOC111116054\n",
      "NC_035780.1\t169467\t170178\trna-XR_002635081.1\t100\t-\t169467\t170178\t0,0,0\t1\t711,\t0,\tgeneID=gene-LOC111105702;gene_name=LOC111105702\n",
      "NC_035780.1\t900325\t903430\trna-XR_002636046.1\t100\t+\t900325\t903430\t0,0,0\t1\t3105,\t0,\tgeneID=gene-LOC111111519;gene_name=LOC111111519\n",
      "NC_035780.1\t1280830\t1282416\trna-XR_002638148.1\t100\t-\t1280830\t1282416\t0,0,0\t1\t1586,\t0,\tgeneID=gene-LOC111124195;gene_name=LOC111124195\n",
      "NC_035780.1\t1432943\t1458091\trna-XR_002639675.1\t100\t+\t1432943\t1458091\t0,0,0\t1\t25148,\t0,\tgeneID=gene-LOC111135942;gene_name=LOC111135942\n",
      "NC_035780.1\t1503801\t1513830\trna-XR_002636574.1\t100\t-\t1503801\t1513830\t0,0,0\t1\t10029,\t0,\tgeneID=gene-LOC111114441;gene_name=LOC111114441\n",
      "NC_035780.1\t1856840\t1863683\trna-XR_002636864.1\t100\t-\t1856840\t1863683\t0,0,0\t1\t6843,\t0,\tgeneID=gene-LOC111115591;gene_name=LOC111115591\n",
      "NC_035780.1\t1856840\t1863697\trna-XR_002636863.1\t100\t-\t1856840\t1863697\t0,0,0\t1\t6857,\t0,\tgeneID=gene-LOC111115591;gene_name=LOC111115591\n",
      "NC_035780.1\t2161222\t2166803\trna-XR_002635698.1\t100\t+\t2161222\t2166803\t0,0,0\t1\t5581,\t0,\tgeneID=gene-LOC111109763;gene_name=LOC111109763\n",
      "NC_035780.1\t2928483\t2930094\trna-XR_002637875.1\t100\t-\t2928483\t2930094\t0,0,0\t1\t1611,\t0,\tgeneID=gene-LOC111122009;gene_name=LOC111122009\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "head ${analysis_dir}/\"${lncRNA_bed}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1598092-a1d5-4c8d-ab1c-3bd9d8fcc73a",
   "metadata": {},
   "source": [
    "### Convert lncRNA GFF to GTF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "9cf5e7d6-c5eb-4bc5-af4f-b2955c4de478",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "cd \"${data_dir}\"\n",
    "\n",
    "${gffread} -E \\\n",
    "${analysis_dir}/\"${lncRNA_gff}\" -T \\\n",
    "1> ${analysis_dir}/\"${lncRNA_gtf}\" \\\n",
    "2> ${analysis_dir}/gffread-lncRNA_gff-to-lncRNA_gtf.stderr"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e50df042-59c0-4327-b3a4-f14399eba05f",
   "metadata": {},
   "source": [
    "### Inspect lncRNA GTF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "98a870cf-518f-44e5-8c7a-eb32c219ea68",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NC_035780.1\tGnomon\ttranscript\t13578\t14594\t.\t+\t.\ttranscript_id \"rna-XR_002636969.1\"; gene_id \"gene-LOC111116054\"; gene_name \"LOC111116054\"\n",
      "NC_035780.1\tGnomon\texon\t13578\t14594\t.\t+\t.\ttranscript_id \"rna-XR_002636969.1\"; gene_id \"gene-LOC111116054\"; gene_name \"LOC111116054\";\n",
      "NC_035780.1\tGnomon\ttranscript\t169468\t170178\t.\t-\t.\ttranscript_id \"rna-XR_002635081.1\"; gene_id \"gene-LOC111105702\"; gene_name \"LOC111105702\"\n",
      "NC_035780.1\tGnomon\texon\t169468\t170178\t.\t-\t.\ttranscript_id \"rna-XR_002635081.1\"; gene_id \"gene-LOC111105702\"; gene_name \"LOC111105702\";\n",
      "NC_035780.1\tGnomon\ttranscript\t900326\t903430\t.\t+\t.\ttranscript_id \"rna-XR_002636046.1\"; gene_id \"gene-LOC111111519\"; gene_name \"LOC111111519\"\n",
      "NC_035780.1\tGnomon\texon\t900326\t903430\t.\t+\t.\ttranscript_id \"rna-XR_002636046.1\"; gene_id \"gene-LOC111111519\"; gene_name \"LOC111111519\";\n",
      "NC_035780.1\tGnomon\ttranscript\t1280831\t1282416\t.\t-\t.\ttranscript_id \"rna-XR_002638148.1\"; gene_id \"gene-LOC111124195\"; gene_name \"LOC111124195\"\n",
      "NC_035780.1\tGnomon\texon\t1280831\t1282416\t.\t-\t.\ttranscript_id \"rna-XR_002638148.1\"; gene_id \"gene-LOC111124195\"; gene_name \"LOC111124195\";\n",
      "NC_035780.1\tGnomon\ttranscript\t1432944\t1458091\t.\t+\t.\ttranscript_id \"rna-XR_002639675.1\"; gene_id \"gene-LOC111135942\"; gene_name \"LOC111135942\"\n",
      "NC_035780.1\tGnomon\texon\t1432944\t1458091\t.\t+\t.\ttranscript_id \"rna-XR_002639675.1\"; gene_id \"gene-LOC111135942\"; gene_name \"LOC111135942\";\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "head ${analysis_dir}/\"${lncRNA_gtf}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8818969-1157-4bdd-9ef4-b41cb9950314",
   "metadata": {},
   "source": [
    "### Exract lncRNAs to FastA\n",
    "\n",
    "Explanation of GffRead options used below:\n",
    "\n",
    "- `-w`: specifies output FastA file\n",
    "\n",
    "- `-W`: specifies to write coordinates of all exons spliced in FastA deflines\n",
    "\n",
    "- `-g`: specifies input FastA (needs to have a corresponding FastA index file in same directory)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "1b77eb6a-5c08-4fca-ad7e-54722159b284",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "cd \"${data_dir}\"\n",
    "\n",
    "${gffread} -E \\\n",
    "-w ${analysis_dir}/\"${lncRNA_fasta}\" -W \\\n",
    "-g \"${ncbi_fasta}\" \\\n",
    "${analysis_dir}/\"${lncRNA_gtf}\" \\\n",
    "2> ${analysis_dir}/gffread_lncRNA-fasta-extraction.stderr"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f107e7ea-2066-479c-9710-bd5be61688ca",
   "metadata": {},
   "source": [
    "### Inspect lncRNA FastA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "c414b8ea-ff50-467a-a11c-2243d3de82fc",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">rna-XR_002636969.1 loc:NC_035780.1|13578-14594|+ exons:13578-14594 segs:1-1017\n",
      "tgatattgttgtgtGCAGAACGTggggtaagaaaacatgcaacactcataatattttacaatctgtctaG\n",
      "TTTTCGTTGGACACATCCCACATACTAGAGGAAGGTCAGAAGCATGGGGGTGGTGGCATgctttttacac\n",
      "tgaatgatcggcagtttgcagtgttcaactccaaatctcttctatgcacaaatcaaataacaaactttac\n",
      "aCAGCTGTTACATGGAAAGTacctacatattttcataatggaaagaaataattatgaccatcacactgta\n",
      "ttgaatttactagagaatatattgacttagaaggtttttttttaactttgtactggctgccaggcatgat\n",
      "aacatgctacatcatacatgttgacttttaatcatcttaatagaagtaaaaacaataaaggtaatctctc\n",
      "tgaaataaacttttattgatgaatgcattgatatgtatacatgtatgtcatcacagttttctcactatca\n",
      "ttcctgaaatgtacagtgtcagctgatgtcatgatgatctacattttacataaaaattttcctCCTGAGA\n",
      "TAAAAAGCGCAGATTAATATTTCACTCAATCccattttaactgttttattatacatattaactcttaaac\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "head ${analysis_dir}/\"${lncRNA_fasta}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6217966c-f3d3-4071-b0ad-c1dbac0c0573",
   "metadata": {},
   "source": [
    "### Create lncRNA FastA index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "56615e1b-cb08-49e3-bf4a-d5c858e621ad",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-rw-r-- 1 sam sam 179K Feb 18 07:11 GCF_002022765.2_C_virginica-3.0_lncRNA.fa.fai\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "${samtools} faidx \"${lncRNA_fasta}\"\n",
    "\n",
    "ls -ltrh \"${lncRNA_fasta_index}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ed920e4-7306-4841-819f-2e7f2b1ed19c",
   "metadata": {},
   "source": [
    "### Inspect lncRNA FastA index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "01e161e4-2e24-492c-8112-f9a7641b54d6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "rna-XR_002636969.1\t1017\t80\t70\t71\n",
      "rna-XR_002635081.1\t711\t1195\t70\t71\n",
      "rna-XR_002636046.1\t3105\t2001\t70\t71\n",
      "rna-XR_002638148.1\t1586\t5239\t70\t71\n",
      "rna-XR_002639675.1\t25148\t6937\t70\t71\n",
      "rna-XR_002636574.1\t10029\t32534\t70\t71\n",
      "rna-XR_002636864.1\t6843\t42795\t70\t71\n",
      "rna-XR_002636863.1\t6857\t49824\t70\t71\n",
      "rna-XR_002635698.1\t5581\t56867\t70\t71\n",
      "rna-XR_002637875.1\t1611\t62616\t70\t71\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "head \"${lncRNA_fasta_index}\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0b01c2f-2c68-4821-ae0e-f19c43231119",
   "metadata": {},
   "source": [
    "### Generate checksums"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "9a8e9dcb-eb74-4af4-8924-0553fe6f3596",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "28de37c9ee1308ac1175397d16b3aafe  GCF_002022765.2_C_virginica-3.0_lncRNA.bed\n",
      "7fac9e7191915f763cc7f5d22838ac25  GCF_002022765.2_C_virginica-3.0_lncRNA.fa\n",
      "1b43db284950abc07afb5f50164fb264  GCF_002022765.2_C_virginica-3.0_lncRNA.fa.fai\n",
      "00755b8c80166cdec94b09f231ef440a  GCF_002022765.2_C_virginica-3.0_lncRNA.gff\n",
      "dedab056acd679cf4eab83629882ee10  GCF_002022765.2_C_virginica-3.0_lncRNA.gtf\n",
      "7ec412a022f43cfeb7729e55aac78ef6  gffread_lncRNA-fasta-extraction.stderr\n",
      "cba3ae8e2474861cd60aa304269b66a8  gffread-lncRNA_gff-to-lncRNA_gtf.stderr\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "for file in *\n",
    "do\n",
    "  md5sum \"${file}\" | tee --append checksums.md5\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3eed5b68-470e-4b9a-af97-9ba2f27a51b5",
   "metadata": {},
   "source": [
    "### Document GffRead program options"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "61be6360-3e83-4cd9-a1db-4554995b8771",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "gffread v0.12.7. Usage:\n",
      "gffread [-g <genomic_seqs_fasta> | <dir>] [-s <seq_info.fsize>] \n",
      " [-o <outfile>] [-t <trackname>] [-r [<strand>]<chr>:<start>-<end> [-R]]\n",
      " [--jmatch <chr>:<start>-<end>] [--no-pseudo] \n",
      " [-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>]\n",
      " [-j ][--ids <IDs.lst> | --nids <IDs.lst>] [--attrs <attr-list>] [-i <maxintron>]\n",
      " [--stream] [--bed | --gtf | --tlf] [--table <attrlist>] [--sort-by <ref.lst>]\n",
      " [<input_gff>] \n",
      "\n",
      " Filter, convert or cluster GFF/GTF/BED records, extract the sequence of\n",
      " transcripts (exon or CDS) and more.\n",
      " By default (i.e. without -O) only transcripts are processed, discarding any\n",
      " other non-transcript features. Default output is a simplified GFF3 with only\n",
      " the basic attributes.\n",
      " \n",
      "Options:\n",
      " --ids discard records/transcripts if their IDs are not listed in <IDs.lst>\n",
      " --nids discard records/transcripts if their IDs are listed in <IDs.lst>\n",
      " -i   discard transcripts having an intron larger than <maxintron>\n",
      " -l   discard transcripts shorter than <minlen> bases\n",
      " -r   only show transcripts overlapping coordinate range <start>..<end>\n",
      "      (on chromosome/contig <chr>, strand <strand> if provided)\n",
      " -R   for -r option, discard all transcripts that are not fully \n",
      "      contained within the given range\n",
      " --jmatch only output transcripts matching the given junction\n",
      " -U   discard single-exon transcripts\n",
      " -C   coding only: discard mRNAs that have no CDS features\n",
      " --nc non-coding only: discard mRNAs that have CDS features\n",
      " --ignore-locus : discard locus features and attributes found in the input\n",
      " -A   use the description field from <seq_info.fsize> and add it\n",
      "      as the value for a 'descr' attribute to the GFF record\n",
      " -s   <seq_info.fsize> is a tab-delimited file providing this info\n",
      "      for each of the mapped sequences:\n",
      "      <seq-name> <seq-length> <seq-description>\n",
      "      (useful for -A option with mRNA/EST/protein mappings)\n",
      "Sorting: (by default, chromosomes are kept in the order they were found)\n",
      " --sort-alpha : chromosomes (reference sequences) are sorted alphabetically\n",
      " --sort-by : sort the reference sequences by the order in which their\n",
      "      names are given in the <refseq.lst> file\n",
      "Misc options: \n",
      " -F   keep all GFF attributes (for non-exon features)\n",
      " --keep-exon-attrs : for -F option, do not attempt to reduce redundant\n",
      "      exon/CDS attributes\n",
      " -G   do not keep exon attributes, move them to the transcript feature\n",
      "      (for GFF3 output)\n",
      " --attrs <attr-list> only output the GTF/GFF attributes listed in <attr-list>\n",
      "    which is a comma delimited list of attribute names to\n",
      " --keep-genes : in transcript-only mode (default), also preserve gene records\n",
      " --keep-comments: for GFF3 input/output, try to preserve comments\n",
      " -O   process other non-transcript GFF records (by default non-transcript\n",
      "      records are ignored)\n",
      " -V   discard any mRNAs with CDS having in-frame stop codons (requires -g)\n",
      " -H   for -V option, check and adjust the starting CDS phase\n",
      "      if the original phase leads to a translation with an \n",
      "      in-frame stop codon\n",
      " -B   for -V option, single-exon transcripts are also checked on the\n",
      "      opposite strand (requires -g)\n",
      " -P   add transcript level GFF attributes about the coding status of each\n",
      "      transcript, including partialness or in-frame stop codons (requires -g)\n",
      " --add-hasCDS : add a \"hasCDS\" attribute with value \"true\" for transcripts\n",
      "      that have CDS features\n",
      " --adj-stop stop codon adjustment: enables -P and performs automatic\n",
      "      adjustment of the CDS stop coordinate if premature or downstream\n",
      " -N   discard multi-exon mRNAs that have any intron with a non-canonical\n",
      "      splice site consensus (i.e. not GT-AG, GC-AG or AT-AC)\n",
      " -J   discard any mRNAs that either lack initial START codon\n",
      "      or the terminal STOP codon, or have an in-frame stop codon\n",
      "      (i.e. only print mRNAs with a complete CDS)\n",
      " --no-pseudo: filter out records matching the 'pseudo' keyword\n",
      " --in-bed: input should be parsed as BED format (automatic if the input\n",
      "           filename ends with .bed*)\n",
      " --in-tlf: input GFF-like one-line-per-transcript format without exon/CDS\n",
      "           features (see --tlf option below); automatic if the input\n",
      "           filename ends with .tlf)\n",
      " --stream: fast processing of input GFF/BED transcripts as they are received\n",
      "           ((no sorting, exons must be grouped by transcript in the input data)\n",
      "Clustering:\n",
      " -M/--merge : cluster the input transcripts into loci, discarding\n",
      "      \"redundant\" transcripts (those with the same exact introns\n",
      "      and fully contained or equal boundaries)\n",
      " -d <dupinfo> : for -M option, write duplication info to file <dupinfo>\n",
      " --cluster-only: same as -M/--merge but without discarding any of the\n",
      "      \"duplicate\" transcripts, only create \"locus\" features\n",
      " -K   for -M option: also discard as redundant the shorter, fully contained\n",
      "       transcripts (intron chains matching a part of the container)\n",
      " -Q   for -M option, no longer require boundary containment when assessing\n",
      "      redundancy (can be combined with -K); only introns have to match for\n",
      "      multi-exon transcripts, and >=80% overlap for single-exon transcripts\n",
      " -Y   for -M option, enforce -Q but also discard overlapping single-exon \n",
      "      transcripts, even on the opposite strand (can be combined with -K)\n",
      "Output options:\n",
      " --force-exons: make sure that the lowest level GFF features are considered\n",
      "       \"exon\" features\n",
      " --gene2exon: for single-line genes not parenting any transcripts, add an\n",
      "       exon feature spanning the entire gene (treat it as a transcript)\n",
      " --t-adopt:  try to find a parent gene overlapping/containing a transcript\n",
      "       that does not have any explicit gene Parent\n",
      " -D    decode url encoded characters within attributes\n",
      " -Z    merge very close exons into a single exon (when intron size<4)\n",
      " -g   full path to a multi-fasta file with the genomic sequences\n",
      "      for all input mappings, OR a directory with single-fasta files\n",
      "      (one per genomic sequence, with file names matching sequence names)\n",
      " -j    output the junctions and the corresponding transcripts\n",
      " -w    write a fasta file with spliced exons for each transcript\n",
      " --w-add <N> for the -w option, extract additional <N> bases\n",
      "       both upstream and downstream of the transcript boundaries\n",
      " --w-nocds for -w, disable the output of CDS info in the FASTA file\n",
      " -x    write a fasta file with spliced CDS for each GFF transcript\n",
      " -y    write a protein fasta file with the translation of CDS for each record\n",
      " -W    for -w, -x and -y options, write in the FASTA defline all the exon\n",
      "       coordinates projected onto the spliced sequence;\n",
      " -S    for -y option, use '*' instead of '.' as stop codon translation\n",
      " -L    Ensembl GTF to GFF3 conversion, adds version to IDs\n",
      " -m    <chr_replace> is a name mapping table for converting reference \n",
      "       sequence names, having this 2-column format:\n",
      "       <original_ref_ID> <new_ref_ID>\n",
      " -t    use <trackname> in the 2nd column of each GFF/GTF output line\n",
      " -o    write the output records into <outfile> instead of stdout\n",
      " -T    main output will be GTF instead of GFF3\n",
      " --bed output records in BED format instead of default GFF3\n",
      " --tlf output \"transcript line format\" which is like GFF\n",
      "       but with exons and CDS related features stored as GFF \n",
      "       attributes in the transcript feature line, like this:\n",
      "         exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords> \n",
      "       <exons> is a comma-delimited list of exon_start-exon_end coordinates;\n",
      "       <CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>\n",
      " --table output a simple tab delimited format instead of GFF, with columns\n",
      "       having the values of GFF attributes given in <attrlist>; special\n",
      "       pseudo-attributes (prefixed by @) are recognized:\n",
      "       @id, @geneid, @chr, @start, @end, @strand, @numexons, @exons, \n",
      "       @cds, @covlen, @cdslen\n",
      "       If any of -w/-y/-x FASTA output files are enabled, the same fields\n",
      "       (excluding @id) are appended to the definition line of corresponding\n",
      "       FASTA records\n",
      " -v,-E expose (warn about) duplicate transcript IDs and other potential\n",
      "       problems with the given GFF/GTF records\n"
     ]
    },
    {
     "ename": "CalledProcessError",
     "evalue": "Command 'b'${gffread} -h\\n'' returned non-zero exit status 1.",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mCalledProcessError\u001b[0m                        Traceback (most recent call last)",
      "\u001b[0;32m/tmp/ipykernel_36240/1000630337.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mget_ipython\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun_cell_magic\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'bash'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m''\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'${gffread} -h\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/interactiveshell.py\u001b[0m in \u001b[0;36mrun_cell_magic\u001b[0;34m(self, magic_name, line, cell)\u001b[0m\n\u001b[1;32m   2417\u001b[0m             \u001b[0;32mwith\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuiltin_trap\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2418\u001b[0m                 \u001b[0margs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mmagic_arg_s\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2419\u001b[0;31m                 \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2420\u001b[0m             \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2421\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magics/script.py\u001b[0m in \u001b[0;36mnamed_script_magic\u001b[0;34m(line, cell)\u001b[0m\n\u001b[1;32m    140\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    141\u001b[0m                 \u001b[0mline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mscript\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 142\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshebang\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    143\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    144\u001b[0m         \u001b[0;31m# write a basic docstring:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/decorator.py\u001b[0m in \u001b[0;36mfun\u001b[0;34m(*args, **kw)\u001b[0m\n\u001b[1;32m    230\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mkwsyntax\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    231\u001b[0m                 \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkw\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfix\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkw\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msig\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 232\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mcaller\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mextras\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    233\u001b[0m     \u001b[0mfun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    234\u001b[0m     \u001b[0mfun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__doc__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__doc__\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magic.py\u001b[0m in \u001b[0;36m<lambda>\u001b[0;34m(f, *a, **k)\u001b[0m\n\u001b[1;32m    185\u001b[0m     \u001b[0;31m# but it's overkill for just that one bit of state.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    186\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mmagic_deco\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 187\u001b[0;31m         \u001b[0mcall\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mlambda\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    188\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    189\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mcallable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magics/script.py\u001b[0m in \u001b[0;36mshebang\u001b[0;34m(self, line, cell)\u001b[0m\n\u001b[1;32m    243\u001b[0m             \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstderr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mflush\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    244\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mraise_error\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreturncode\u001b[0m\u001b[0;34m!=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 245\u001b[0;31m             \u001b[0;32mraise\u001b[0m \u001b[0mCalledProcessError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreturncode\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moutput\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstderr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    246\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    247\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m_run_script\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mto_close\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mCalledProcessError\u001b[0m: Command 'b'${gffread} -h\\n'' returned non-zero exit status 1."
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "${gffread} -h"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c92e8380-9c6d-47c5-9865-56ad65a09bc6",
   "metadata": {},
   "source": [
    "### Document `gtf_extract` options"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "10d9df64-01a7-4af1-b4f5-c20671021689",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usage: gtf_extract [-h] [-v] [-f FEATURE_TYPE] [--fields FIELD_LIST]\n",
      "                   [-o OUTFILE] [--gff] [-k]\n",
      "                   GTF_FILE\n",
      "\n",
      "Extract selected data items from a GTF file and output in tab-delimited\n",
      "format. The program can also operate on GFF files provided the --gff option is\n",
      "specified.\n",
      "\n",
      "positional arguments:\n",
      "  GTF_FILE              input GTF file to extract data items from\n",
      "\n",
      "optional arguments:\n",
      "  -h, --help            show this help message and exit\n",
      "  -v, --version         show program's version number and exit\n",
      "  -f FEATURE_TYPE, --feature FEATURE_TYPE\n",
      "                        only extract data for lines where feature is\n",
      "                        FEATURE_TYPE\n",
      "  --fields FIELD_LIST   comma-separated list of fields to output in tab-\n",
      "                        delimited format for each line in the GTF, e.g.\n",
      "                        'chrom,start,end'. Fields can either be a GTF field\n",
      "                        name (i.e. 'chrom', 'source', 'feature', 'start',\n",
      "                        'end', 'score', 'strand' and 'frame') or the name of\n",
      "                        an attribute (e.g. 'gene_name', 'gene_id' etc). Data\n",
      "                        items are output in the order they appear in\n",
      "                        FIELD_LIST. If a field doesn't exist for a line then\n",
      "                        '.' will be output as the value.\n",
      "  -o OUTFILE            write output to OUTFILE (default is to write to\n",
      "                        stdout)\n",
      "  --gff                 specify that the input file is GFF rather than GTF\n",
      "                        format\n",
      "  -k, --keep-headers    copy headers from input file to output\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "gtf_extract -h"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d2ec5e6-6221-4bc9-af10-00ab6cc67183",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}