{ "cells": [ { "cell_type": "markdown", "id": "2949abe4-dfa6-4c4a-a395-a3a1db92b5e7", "metadata": {}, "source": [ "## Create _C.virginia_ long, non-coding RNA files.\n", "\n", "### Downloads files from NCBI.\n", "\n", "### Notebook relies on:\n", "\n", "- [GffRead](https://github.com/gpertea/gffread)\n", "\n", "- [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) available in your `$PATH`.\n", "\n", " - I accomplished this by creating/activating a conda environment for [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) and running this notebook from within that environment.\n", "\n", "- [samtools](http://www.htslib.org/).\n", "\n", "### Resulting files will be used for [_C.virginica_ RNAseq/DML sex/OA project](https://github.com/epigeneticstoocean/2018_L18-adult-methylation) (GitHub repo)" ] }, { "cell_type": "markdown", "id": "ee0ebae6-d54c-4d18-88bd-3d3456a8b1e6", "metadata": {}, "source": [ "### List computer specs" ] }, { "cell_type": "code", "execution_count": 1, "id": "9f60016c-d6b6-4b6d-86d3-5ee68b55464c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TODAY'S DATE:\n", "Fri 18 Feb 2022 07:10:15 AM PST\n", "------------\n", "\n", "Distributor ID:\tUbuntu\n", "Description:\tUbuntu 20.04.3 LTS\n", "Release:\t20.04\n", "Codename:\tfocal\n", "\n", "------------\n", "HOSTNAME: \n", "computer\n", "\n", "------------\n", "Computer Specs:\n", "\n", "Architecture: x86_64\n", "CPU op-mode(s): 32-bit, 64-bit\n", "Byte Order: Little Endian\n", "Address sizes: 45 bits physical, 48 bits virtual\n", "CPU(s): 2\n", "On-line CPU(s) list: 0,1\n", "Thread(s) per core: 1\n", "Core(s) per socket: 1\n", "Socket(s): 2\n", "NUMA node(s): 1\n", "Vendor ID: GenuineIntel\n", "CPU family: 6\n", "Model: 165\n", "Model name: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz\n", "Stepping: 2\n", "CPU MHz: 2400.008\n", "BogoMIPS: 4800.01\n", "Hypervisor vendor: VMware\n", "Virtualization type: full\n", "L1d cache: 64 KiB\n", "L1i cache: 64 KiB\n", "L2 cache: 512 KiB\n", "L3 cache: 32 MiB\n", "NUMA node0 CPU(s): 0,1\n", "Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported\n", "Vulnerability L1tf: Mitigation; PTE Inversion\n", "Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\n", "Vulnerability Meltdown: Mitigation; PTI\n", "Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\n", "Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\n", "Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling\n", "Vulnerability Srbds: Not affected\n", "Vulnerability Tsx async abort: Not affected\n", "Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat flush_l1d arch_capabilities\n", "\n", "------------\n", "\n", "Memory Specs\n", "\n", " total used free shared buff/cache available\n", "Mem: 54Gi 3.2Gi 46Gi 138Mi 5.1Gi 50Gi\n", "Swap: 2.0Gi 0B 2.0Gi\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "No LSB modules are available.\n" ] } ], "source": [ "%%bash\n", "echo \"TODAY'S DATE:\"\n", "date\n", "echo \"------------\"\n", "echo \"\"\n", "#Display operating system info\n", "lsb_release -a\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"HOSTNAME: \"; hostname \n", "echo \"\"\n", "echo \"------------\"\n", "echo \"Computer Specs:\"\n", "echo \"\"\n", "lscpu\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"\"\n", "echo \"Memory Specs\"\n", "echo \"\"\n", "free -mh" ] }, { "cell_type": "markdown", "id": "19866f68-bfad-4a83-adb5-24e271e29d06", "metadata": {}, "source": [ "### Set variables\n", "- `%env` indicates a bash variable\n", "\n", "- without `%env` is Python variable" ] }, { "cell_type": "code", "execution_count": 2, "id": "7293bcb0-581c-4ad2-8f1e-09dd98352aaf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: data_dir=/home/sam/data/C_virginica/genomes\n", "env: analysis_dir=/home/sam/analyses/20220217-cvir-lncRNA_subsetting\n", "env: ncbi_fasta=GCF_002022765.2_C_virginica-3.0_genomic.fna\n", "env: ncbi_fasta_index=GCF_002022765.2_C_virginica-3.0_genomic.fna.fai\n", "env: ncbi_fasta_gz=GCF_002022765.2_C_virginica-3.0_genomic.fna.gz\n", "env: ncbi_gff=GCF_002022765.2_C_virginica-3.0_genomic.gff\n", "env: ncbi_gff_gz=GCF_002022765.2_C_virginica-3.0_genomic.gff.gz\n", "env: ncbi_url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0\n", "env: lncRNA_bed=GCF_002022765.2_C_virginica-3.0_lncRNA.bed\n", "env: lncRNA_gff=GCF_002022765.2_C_virginica-3.0_lncRNA.gff\n", "env: lncRNA_gtf=GCF_002022765.2_C_virginica-3.0_lncRNA.gtf\n", "env: lncRNA_fasta=GCF_002022765.2_C_virginica-3.0_lncRNA.fa\n", "env: lncRNA_fasta_index=GCF_002022765.2_C_virginica-3.0_lncRNA.fa.fai\n", "env: gffread=/home/sam/programs/gffread-0.12.7.Linux_x86_64/gffread\n", "env: samtools=/home/sam/programs/samtools-1.12/samtools\n" ] } ], "source": [ "# Set directories, input/output files\n", "%env data_dir=/home/sam/data/C_virginica/genomes\n", "%env analysis_dir=/home/sam/analyses/20220217-cvir-lncRNA_subsetting\n", "analysis_dir=\"20220217-cvir-lncRNA_subsetting\"\n", "\n", "# Input files (from NCBI)\n", "%env ncbi_fasta=GCF_002022765.2_C_virginica-3.0_genomic.fna\n", "%env ncbi_fasta_index=GCF_002022765.2_C_virginica-3.0_genomic.fna.fai\n", "%env ncbi_fasta_gz=GCF_002022765.2_C_virginica-3.0_genomic.fna.gz\n", "%env ncbi_gff=GCF_002022765.2_C_virginica-3.0_genomic.gff\n", "%env ncbi_gff_gz=GCF_002022765.2_C_virginica-3.0_genomic.gff.gz\n", "\n", "# URL to download files from NCBI\n", "%env ncbi_url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0\n", "\n", "# Output files\n", "%env lncRNA_bed=GCF_002022765.2_C_virginica-3.0_lncRNA.bed\n", "%env lncRNA_gff=GCF_002022765.2_C_virginica-3.0_lncRNA.gff\n", "%env lncRNA_gtf=GCF_002022765.2_C_virginica-3.0_lncRNA.gtf\n", "%env lncRNA_fasta=GCF_002022765.2_C_virginica-3.0_lncRNA.fa\n", "%env lncRNA_fasta_index=GCF_002022765.2_C_virginica-3.0_lncRNA.fa.fai\n", "\n", "# Set program locations\n", "%env gffread=/home/sam/programs/gffread-0.12.7.Linux_x86_64/gffread\n", "%env samtools=/home/sam/programs/samtools-1.12/samtools" ] }, { "cell_type": "markdown", "id": "7f204c16-2d1f-4837-93b0-1fb0e3d00d64", "metadata": {}, "source": [ "### Create analysis directory" ] }, { "cell_type": "code", "execution_count": 3, "id": "8f275e34-c56e-4754-abf7-3279667434bb", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# Make analysis directory, if it doesn't exist\n", "mkdir --parents \"${analysis_dir}\"" ] }, { "cell_type": "markdown", "id": "56052f6d-441a-4048-8a6f-39d58552283d", "metadata": {}, "source": [ "### Download GFF" ] }, { "cell_type": "code", "execution_count": 4, "id": "951fc8e9-b821-4f54-848f-f9573daadc83", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 sam sam 412M Dec 10 2019 GCF_002022765.2_C_virginica-3.0_genomic.gff\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "gzip: GCF_002022765.2_C_virginica-3.0_genomic.gff already exists;\tnot overwritten\n" ] } ], "source": [ "%%bash\n", "cd \"${data_dir}\"\n", "\n", "# Download with wget.\n", "# Use --quiet option to prevent wget output from printing too many lines to notebook\n", "# Use --continue to prevent re-downloading fie if it's already been downloaded.\n", "wget --quiet \\\n", "--continue \\\n", "${ncbi_url}/${ncbi_gff_gz}\n", "\n", "# Unzip download GFF\n", "gunzip \"${ncbi_gff_gz}\"\n", "\n", "ls -ltrh \"${ncbi_gff}\"" ] }, { "cell_type": "markdown", "id": "7eb8b7c0-5927-44c5-ba79-cbee1d5a77fb", "metadata": {}, "source": [ "### Examine GFF" ] }, { "cell_type": "code", "execution_count": 5, "id": "18862291-d1ec-4b62-8b22-959404538a7f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "##gff-version 3\n", "#!gff-spec-version 1.21\n", "#!processor NCBI annotwriter\n", "#!genome-build C_virginica-3.0\n", "#!genome-build-accession NCBI_Assembly:GCF_002022765.2\n", "#!annotation-source NCBI Crassostrea virginica Annotation Release 100\n", "##sequence-region NC_035780.1 1 65668440\n", "##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6565\n", "NC_035780.1\tRefSeq\tregion\t1\t65668440\t.\t+\t.\tID=NC_035780.1:1..65668440;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample\n", "NC_035780.1\tGnomon\tgene\t13578\t14594\t.\t+\t.\tID=gene-LOC111116054;Dbxref=GeneID:111116054;Name=LOC111116054;gbkey=Gene;gene=LOC111116054;gene_biotype=lncRNA\n", "NC_035780.1\tGnomon\tlnc_RNA\t13578\t14594\t.\t+\t.\tID=rna-XR_002636969.1;Parent=gene-LOC111116054;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n", "NC_035780.1\tGnomon\texon\t13578\t13603\t.\t+\t.\tID=exon-XR_002636969.1-1;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n", "NC_035780.1\tGnomon\texon\t14237\t14290\t.\t+\t.\tID=exon-XR_002636969.1-2;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n", "NC_035780.1\tGnomon\texon\t14557\t14594\t.\t+\t.\tID=exon-XR_002636969.1-3;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n", "NC_035780.1\tGnomon\tgene\t28961\t33324\t.\t+\t.\tID=gene-LOC111126949;Dbxref=GeneID:111126949;Name=LOC111126949;gbkey=Gene;gene=LOC111126949;gene_biotype=protein_coding\n", "NC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna-XM_022471938.1;Parent=gene-LOC111126949;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\n", "NC_035780.1\tGnomon\texon\t28961\t29073\t.\t+\t.\tID=exon-XM_022471938.1-1;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n", "NC_035780.1\tGnomon\texon\t30524\t31557\t.\t+\t.\tID=exon-XM_022471938.1-2;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n", "NC_035780.1\tGnomon\texon\t31736\t31887\t.\t+\t.\tID=exon-XM_022471938.1-3;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n", "NC_035780.1\tGnomon\texon\t31977\t32565\t.\t+\t.\tID=exon-XM_022471938.1-4;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n" ] } ], "source": [ "%%bash\n", "head -n 20 \"${data_dir}\"/\"${ncbi_gff}\"" ] }, { "cell_type": "markdown", "id": "27fb9e06-b925-4bf9-a3d7-154ffed294a2", "metadata": {}, "source": [ "### Download NCBI genomic FastA" ] }, { "cell_type": "code", "execution_count": 6, "id": "cf72f0a7-3740-45c8-bc74-12433779df5f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 sam sam 662M Dec 10 2019 GCF_002022765.2_C_virginica-3.0_genomic.fna\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "gzip: GCF_002022765.2_C_virginica-3.0_genomic.fna already exists;\tnot overwritten\n" ] } ], "source": [ "%%bash\n", "cd \"${data_dir}\"\n", "\n", "# Download with wget.\n", "# Use --quiet option to prevent wget output from printing too many lines to notebook\n", "# Use --continue to prevent re-downloading fie if it's already been downloaded.\n", "wget --quiet \\\n", "--continue \\\n", "${ncbi_url}/${ncbi_fasta_gz}\n", "\n", "# Unzip download GFF\n", "gunzip \"${ncbi_fasta_gz}\"\n", "\n", "ls -ltrh \"${ncbi_fasta}\"\n" ] }, { "cell_type": "markdown", "id": "9eedff2e-dcaf-4324-8722-45aab3f0a616", "metadata": {}, "source": [ "### Create FastA index with [Samtools](http://www.htslib.org/)" ] }, { "cell_type": "code", "execution_count": 7, "id": "7fb54c0d-d2a6-4578-8d08-0e5ad17a179a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 sam sam 398 Feb 18 07:10 GCF_002022765.2_C_virginica-3.0_genomic.fna.fai\n" ] } ], "source": [ "%%bash\n", "cd \"${data_dir}\"\n", "\n", "${samtools} faidx \"${ncbi_fasta}\"\n", "\n", "ls -ltrh \"${ncbi_fasta_index}\"" ] }, { "cell_type": "markdown", "id": "5406fed7-865e-445c-9c75-55b9e8d56a40", "metadata": {}, "source": [ "### Inspect NCBI genomic FastA index" ] }, { "cell_type": "code", "execution_count": 8, "id": "07cc3f6b-a512-426f-98c3-c0ac2d92e0b9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\t65668440\t117\t80\t81\n", "NC_035781.1\t61752955\t66489530\t80\t81\n", "NC_035782.1\t77061148\t129014514\t80\t81\n", "NC_035783.1\t59691872\t207039044\t80\t81\n", "NC_035784.1\t98698416\t267477182\t80\t81\n", "NC_035785.1\t51258098\t367409446\t80\t81\n", "NC_035786.1\t57830854\t419308388\t80\t81\n", "NC_035787.1\t75944018\t477862245\t80\t81\n", "NC_035788.1\t104168038\t554755681\t80\t81\n", "NC_035789.1\t32650045\t660225938\t80\t81\n" ] } ], "source": [ "%%bash\n", "cd \"${data_dir}\"\n", "\n", "head \"${ncbi_fasta_index}\"" ] }, { "cell_type": "markdown", "id": "9827c0d1-ba5c-4ef8-b703-895433a8f3bd", "metadata": {}, "source": [ "### Extracts lncRNAs from genomic GFF using `gtf_extract` from [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html)" ] }, { "cell_type": "code", "execution_count": 9, "id": "3dc274bf-228a-4d08-bd30-de6872a9ecc0", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "##gff-version 3\n", "#!gff-spec-version 1.21\n", "#!processor NCBI annotwriter\n", "#!genome-build C_virginica-3.0\n", "#!genome-build-accession NCBI_Assembly:GCF_002022765.2\n", "#!annotation-source NCBI Crassostrea virginica Annotation Release 100\n", "##sequence-region NC_035780.1 1 65668440\n", "#!lncRNA only - created by Sam White Fri 18 Feb 2022 07:10:32 AM PST\n", "NC_035780.1\tGnomon\tlnc_RNA\t13578\t14594\t.\t+\t.\tID=rna-XR_002636969.1;Parent=gene-LOC111116054;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n", "NC_035780.1\tGnomon\tlnc_RNA\t169468\t170178\t.\t-\t.\tID=rna-XR_002635081.1;Parent=gene-LOC111105702;Dbxref=GeneID:111105702,Genbank:XR_002635081.1;Name=XR_002635081.1;gbkey=ncRNA;gene=LOC111105702;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 3 samples with support for all annotated introns;product=uncharacterized LOC111105702;transcript_id=XR_002635081.1\n" ] } ], "source": [ "%%bash\n", "cd \"${data_dir}\"\n", "\n", "# Capture GFF header from NCBI gff\n", "head -n 7 \"${ncbi_gff}\" > ${analysis_dir}/\"${lncRNA_gff}\"\n", "\n", "# Add note about modification\n", "printf \"#%s%s\\n\" \"!\" \"lncRNA only - created by Sam White $(date)\" >> ${analysis_dir}/\"${lncRNA_gff}\"\n", "\n", "\n", "# Finds lncRNAs in NCBI GFF\n", "gtf_extract \\\n", "--feature lnc_RNA \\\n", "--gff \"${ncbi_gff}\" \\\n", ">> ${analysis_dir}/\"${lncRNA_gff}\"\n", "\n", "\n", "head ${analysis_dir}/\"${lncRNA_gff}\"" ] }, { "cell_type": "markdown", "id": "7b583b9c-ff1d-430f-95b4-c2171c3cd7dd", "metadata": {}, "source": [ "### Extract lncRNAs to BED using [GffRead](https://github.com/gpertea/gffread)" ] }, { "cell_type": "code", "execution_count": 10, "id": "c48e30c2-8241-488e-9db5-bf6ed317be0c", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "cd \"${data_dir}\"\n", "\n", "${gffread} --bed \\\n", "${analysis_dir}/\"${lncRNA_gff}\" \\\n", "> ${analysis_dir}/\"${lncRNA_bed}\"" ] }, { "cell_type": "markdown", "id": "cd342a69-02c3-4822-83a8-b11276185850", "metadata": {}, "source": [ "### Inspect lncRNA BED" ] }, { "cell_type": "code", "execution_count": 11, "id": "3fef57b2-3b95-4871-a447-44212a885146", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\t13577\t14594\trna-XR_002636969.1\t100\t+\t13577\t14594\t0,0,0\t1\t1017,\t0,\tgeneID=gene-LOC111116054;gene_name=LOC111116054\n", "NC_035780.1\t169467\t170178\trna-XR_002635081.1\t100\t-\t169467\t170178\t0,0,0\t1\t711,\t0,\tgeneID=gene-LOC111105702;gene_name=LOC111105702\n", "NC_035780.1\t900325\t903430\trna-XR_002636046.1\t100\t+\t900325\t903430\t0,0,0\t1\t3105,\t0,\tgeneID=gene-LOC111111519;gene_name=LOC111111519\n", "NC_035780.1\t1280830\t1282416\trna-XR_002638148.1\t100\t-\t1280830\t1282416\t0,0,0\t1\t1586,\t0,\tgeneID=gene-LOC111124195;gene_name=LOC111124195\n", "NC_035780.1\t1432943\t1458091\trna-XR_002639675.1\t100\t+\t1432943\t1458091\t0,0,0\t1\t25148,\t0,\tgeneID=gene-LOC111135942;gene_name=LOC111135942\n", "NC_035780.1\t1503801\t1513830\trna-XR_002636574.1\t100\t-\t1503801\t1513830\t0,0,0\t1\t10029,\t0,\tgeneID=gene-LOC111114441;gene_name=LOC111114441\n", "NC_035780.1\t1856840\t1863683\trna-XR_002636864.1\t100\t-\t1856840\t1863683\t0,0,0\t1\t6843,\t0,\tgeneID=gene-LOC111115591;gene_name=LOC111115591\n", "NC_035780.1\t1856840\t1863697\trna-XR_002636863.1\t100\t-\t1856840\t1863697\t0,0,0\t1\t6857,\t0,\tgeneID=gene-LOC111115591;gene_name=LOC111115591\n", "NC_035780.1\t2161222\t2166803\trna-XR_002635698.1\t100\t+\t2161222\t2166803\t0,0,0\t1\t5581,\t0,\tgeneID=gene-LOC111109763;gene_name=LOC111109763\n", "NC_035780.1\t2928483\t2930094\trna-XR_002637875.1\t100\t-\t2928483\t2930094\t0,0,0\t1\t1611,\t0,\tgeneID=gene-LOC111122009;gene_name=LOC111122009\n" ] } ], "source": [ "%%bash\n", "head ${analysis_dir}/\"${lncRNA_bed}\"" ] }, { "cell_type": "markdown", "id": "b1598092-a1d5-4c8d-ab1c-3bd9d8fcc73a", "metadata": {}, "source": [ "### Convert lncRNA GFF to GTF" ] }, { "cell_type": "code", "execution_count": 12, "id": "9cf5e7d6-c5eb-4bc5-af4f-b2955c4de478", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "cd \"${data_dir}\"\n", "\n", "${gffread} -E \\\n", "${analysis_dir}/\"${lncRNA_gff}\" -T \\\n", "1> ${analysis_dir}/\"${lncRNA_gtf}\" \\\n", "2> ${analysis_dir}/gffread-lncRNA_gff-to-lncRNA_gtf.stderr" ] }, { "cell_type": "markdown", "id": "e50df042-59c0-4327-b3a4-f14399eba05f", "metadata": {}, "source": [ "### Inspect lncRNA GTF" ] }, { "cell_type": "code", "execution_count": 13, "id": "98a870cf-518f-44e5-8c7a-eb32c219ea68", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_035780.1\tGnomon\ttranscript\t13578\t14594\t.\t+\t.\ttranscript_id \"rna-XR_002636969.1\"; gene_id \"gene-LOC111116054\"; gene_name \"LOC111116054\"\n", "NC_035780.1\tGnomon\texon\t13578\t14594\t.\t+\t.\ttranscript_id \"rna-XR_002636969.1\"; gene_id \"gene-LOC111116054\"; gene_name \"LOC111116054\";\n", "NC_035780.1\tGnomon\ttranscript\t169468\t170178\t.\t-\t.\ttranscript_id \"rna-XR_002635081.1\"; gene_id \"gene-LOC111105702\"; gene_name \"LOC111105702\"\n", "NC_035780.1\tGnomon\texon\t169468\t170178\t.\t-\t.\ttranscript_id \"rna-XR_002635081.1\"; gene_id \"gene-LOC111105702\"; gene_name \"LOC111105702\";\n", "NC_035780.1\tGnomon\ttranscript\t900326\t903430\t.\t+\t.\ttranscript_id \"rna-XR_002636046.1\"; gene_id \"gene-LOC111111519\"; gene_name \"LOC111111519\"\n", "NC_035780.1\tGnomon\texon\t900326\t903430\t.\t+\t.\ttranscript_id \"rna-XR_002636046.1\"; gene_id \"gene-LOC111111519\"; gene_name \"LOC111111519\";\n", "NC_035780.1\tGnomon\ttranscript\t1280831\t1282416\t.\t-\t.\ttranscript_id \"rna-XR_002638148.1\"; gene_id \"gene-LOC111124195\"; gene_name \"LOC111124195\"\n", "NC_035780.1\tGnomon\texon\t1280831\t1282416\t.\t-\t.\ttranscript_id \"rna-XR_002638148.1\"; gene_id \"gene-LOC111124195\"; gene_name \"LOC111124195\";\n", "NC_035780.1\tGnomon\ttranscript\t1432944\t1458091\t.\t+\t.\ttranscript_id \"rna-XR_002639675.1\"; gene_id \"gene-LOC111135942\"; gene_name \"LOC111135942\"\n", "NC_035780.1\tGnomon\texon\t1432944\t1458091\t.\t+\t.\ttranscript_id \"rna-XR_002639675.1\"; gene_id \"gene-LOC111135942\"; gene_name \"LOC111135942\";\n" ] } ], "source": [ "%%bash\n", "head ${analysis_dir}/\"${lncRNA_gtf}\"" ] }, { "cell_type": "markdown", "id": "d8818969-1157-4bdd-9ef4-b41cb9950314", "metadata": {}, "source": [ "### Exract lncRNAs to FastA\n", "\n", "Explanation of GffRead options used below:\n", "\n", "- `-w`: specifies output FastA file\n", "\n", "- `-W`: specifies to write coordinates of all exons spliced in FastA deflines\n", "\n", "- `-g`: specifies input FastA (needs to have a corresponding FastA index file in same directory)" ] }, { "cell_type": "code", "execution_count": 14, "id": "1b77eb6a-5c08-4fca-ad7e-54722159b284", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "cd \"${data_dir}\"\n", "\n", "${gffread} -E \\\n", "-w ${analysis_dir}/\"${lncRNA_fasta}\" -W \\\n", "-g \"${ncbi_fasta}\" \\\n", "${analysis_dir}/\"${lncRNA_gtf}\" \\\n", "2> ${analysis_dir}/gffread_lncRNA-fasta-extraction.stderr" ] }, { "cell_type": "markdown", "id": "f107e7ea-2066-479c-9710-bd5be61688ca", "metadata": {}, "source": [ "### Inspect lncRNA FastA" ] }, { "cell_type": "code", "execution_count": 15, "id": "c414b8ea-ff50-467a-a11c-2243d3de82fc", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">rna-XR_002636969.1 loc:NC_035780.1|13578-14594|+ exons:13578-14594 segs:1-1017\n", "tgatattgttgtgtGCAGAACGTggggtaagaaaacatgcaacactcataatattttacaatctgtctaG\n", "TTTTCGTTGGACACATCCCACATACTAGAGGAAGGTCAGAAGCATGGGGGTGGTGGCATgctttttacac\n", "tgaatgatcggcagtttgcagtgttcaactccaaatctcttctatgcacaaatcaaataacaaactttac\n", "aCAGCTGTTACATGGAAAGTacctacatattttcataatggaaagaaataattatgaccatcacactgta\n", "ttgaatttactagagaatatattgacttagaaggtttttttttaactttgtactggctgccaggcatgat\n", "aacatgctacatcatacatgttgacttttaatcatcttaatagaagtaaaaacaataaaggtaatctctc\n", "tgaaataaacttttattgatgaatgcattgatatgtatacatgtatgtcatcacagttttctcactatca\n", "ttcctgaaatgtacagtgtcagctgatgtcatgatgatctacattttacataaaaattttcctCCTGAGA\n", "TAAAAAGCGCAGATTAATATTTCACTCAATCccattttaactgttttattatacatattaactcttaaac\n" ] } ], "source": [ "%%bash\n", "head ${analysis_dir}/\"${lncRNA_fasta}\"" ] }, { "cell_type": "markdown", "id": "6217966c-f3d3-4071-b0ad-c1dbac0c0573", "metadata": {}, "source": [ "### Create lncRNA FastA index" ] }, { "cell_type": "code", "execution_count": 16, "id": "56615e1b-cb08-49e3-bf4a-d5c858e621ad", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 sam sam 179K Feb 18 07:11 GCF_002022765.2_C_virginica-3.0_lncRNA.fa.fai\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "\n", "${samtools} faidx \"${lncRNA_fasta}\"\n", "\n", "ls -ltrh \"${lncRNA_fasta_index}\"" ] }, { "cell_type": "markdown", "id": "3ed920e4-7306-4841-819f-2e7f2b1ed19c", "metadata": {}, "source": [ "### Inspect lncRNA FastA index" ] }, { "cell_type": "code", "execution_count": 17, "id": "01e161e4-2e24-492c-8112-f9a7641b54d6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "rna-XR_002636969.1\t1017\t80\t70\t71\n", "rna-XR_002635081.1\t711\t1195\t70\t71\n", "rna-XR_002636046.1\t3105\t2001\t70\t71\n", "rna-XR_002638148.1\t1586\t5239\t70\t71\n", "rna-XR_002639675.1\t25148\t6937\t70\t71\n", "rna-XR_002636574.1\t10029\t32534\t70\t71\n", "rna-XR_002636864.1\t6843\t42795\t70\t71\n", "rna-XR_002636863.1\t6857\t49824\t70\t71\n", "rna-XR_002635698.1\t5581\t56867\t70\t71\n", "rna-XR_002637875.1\t1611\t62616\t70\t71\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "\n", "head \"${lncRNA_fasta_index}\"" ] }, { "cell_type": "markdown", "id": "c0b01c2f-2c68-4821-ae0e-f19c43231119", "metadata": {}, "source": [ "### Generate checksums" ] }, { "cell_type": "code", "execution_count": 18, "id": "9a8e9dcb-eb74-4af4-8924-0553fe6f3596", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "28de37c9ee1308ac1175397d16b3aafe GCF_002022765.2_C_virginica-3.0_lncRNA.bed\n", "7fac9e7191915f763cc7f5d22838ac25 GCF_002022765.2_C_virginica-3.0_lncRNA.fa\n", "1b43db284950abc07afb5f50164fb264 GCF_002022765.2_C_virginica-3.0_lncRNA.fa.fai\n", "00755b8c80166cdec94b09f231ef440a GCF_002022765.2_C_virginica-3.0_lncRNA.gff\n", "dedab056acd679cf4eab83629882ee10 GCF_002022765.2_C_virginica-3.0_lncRNA.gtf\n", "7ec412a022f43cfeb7729e55aac78ef6 gffread_lncRNA-fasta-extraction.stderr\n", "cba3ae8e2474861cd60aa304269b66a8 gffread-lncRNA_gff-to-lncRNA_gtf.stderr\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "\n", "for file in *\n", "do\n", " md5sum \"${file}\" | tee --append checksums.md5\n", "done" ] }, { "cell_type": "markdown", "id": "3eed5b68-470e-4b9a-af97-9ba2f27a51b5", "metadata": {}, "source": [ "### Document GffRead program options" ] }, { "cell_type": "code", "execution_count": 19, "id": "61be6360-3e83-4cd9-a1db-4554995b8771", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "gffread v0.12.7. Usage:\n", "gffread [-g | ] [-s ] \n", " [-o ] [-t ] [-r []:- [-R]]\n", " [--jmatch :-] [--no-pseudo] \n", " [-CTVNJMKQAFPGUBHZWTOLE] [-w ] [-x ] [-y ]\n", " [-j ][--ids | --nids ] [--attrs ] [-i ]\n", " [--stream] [--bed | --gtf | --tlf] [--table ] [--sort-by ]\n", " [] \n", "\n", " Filter, convert or cluster GFF/GTF/BED records, extract the sequence of\n", " transcripts (exon or CDS) and more.\n", " By default (i.e. without -O) only transcripts are processed, discarding any\n", " other non-transcript features. Default output is a simplified GFF3 with only\n", " the basic attributes.\n", " \n", "Options:\n", " --ids discard records/transcripts if their IDs are not listed in \n", " --nids discard records/transcripts if their IDs are listed in \n", " -i discard transcripts having an intron larger than \n", " -l discard transcripts shorter than bases\n", " -r only show transcripts overlapping coordinate range ..\n", " (on chromosome/contig , strand if provided)\n", " -R for -r option, discard all transcripts that are not fully \n", " contained within the given range\n", " --jmatch only output transcripts matching the given junction\n", " -U discard single-exon transcripts\n", " -C coding only: discard mRNAs that have no CDS features\n", " --nc non-coding only: discard mRNAs that have CDS features\n", " --ignore-locus : discard locus features and attributes found in the input\n", " -A use the description field from and add it\n", " as the value for a 'descr' attribute to the GFF record\n", " -s is a tab-delimited file providing this info\n", " for each of the mapped sequences:\n", " \n", " (useful for -A option with mRNA/EST/protein mappings)\n", "Sorting: (by default, chromosomes are kept in the order they were found)\n", " --sort-alpha : chromosomes (reference sequences) are sorted alphabetically\n", " --sort-by : sort the reference sequences by the order in which their\n", " names are given in the file\n", "Misc options: \n", " -F keep all GFF attributes (for non-exon features)\n", " --keep-exon-attrs : for -F option, do not attempt to reduce redundant\n", " exon/CDS attributes\n", " -G do not keep exon attributes, move them to the transcript feature\n", " (for GFF3 output)\n", " --attrs only output the GTF/GFF attributes listed in \n", " which is a comma delimited list of attribute names to\n", " --keep-genes : in transcript-only mode (default), also preserve gene records\n", " --keep-comments: for GFF3 input/output, try to preserve comments\n", " -O process other non-transcript GFF records (by default non-transcript\n", " records are ignored)\n", " -V discard any mRNAs with CDS having in-frame stop codons (requires -g)\n", " -H for -V option, check and adjust the starting CDS phase\n", " if the original phase leads to a translation with an \n", " in-frame stop codon\n", " -B for -V option, single-exon transcripts are also checked on the\n", " opposite strand (requires -g)\n", " -P add transcript level GFF attributes about the coding status of each\n", " transcript, including partialness or in-frame stop codons (requires -g)\n", " --add-hasCDS : add a \"hasCDS\" attribute with value \"true\" for transcripts\n", " that have CDS features\n", " --adj-stop stop codon adjustment: enables -P and performs automatic\n", " adjustment of the CDS stop coordinate if premature or downstream\n", " -N discard multi-exon mRNAs that have any intron with a non-canonical\n", " splice site consensus (i.e. not GT-AG, GC-AG or AT-AC)\n", " -J discard any mRNAs that either lack initial START codon\n", " or the terminal STOP codon, or have an in-frame stop codon\n", " (i.e. only print mRNAs with a complete CDS)\n", " --no-pseudo: filter out records matching the 'pseudo' keyword\n", " --in-bed: input should be parsed as BED format (automatic if the input\n", " filename ends with .bed*)\n", " --in-tlf: input GFF-like one-line-per-transcript format without exon/CDS\n", " features (see --tlf option below); automatic if the input\n", " filename ends with .tlf)\n", " --stream: fast processing of input GFF/BED transcripts as they are received\n", " ((no sorting, exons must be grouped by transcript in the input data)\n", "Clustering:\n", " -M/--merge : cluster the input transcripts into loci, discarding\n", " \"redundant\" transcripts (those with the same exact introns\n", " and fully contained or equal boundaries)\n", " -d : for -M option, write duplication info to file \n", " --cluster-only: same as -M/--merge but without discarding any of the\n", " \"duplicate\" transcripts, only create \"locus\" features\n", " -K for -M option: also discard as redundant the shorter, fully contained\n", " transcripts (intron chains matching a part of the container)\n", " -Q for -M option, no longer require boundary containment when assessing\n", " redundancy (can be combined with -K); only introns have to match for\n", " multi-exon transcripts, and >=80% overlap for single-exon transcripts\n", " -Y for -M option, enforce -Q but also discard overlapping single-exon \n", " transcripts, even on the opposite strand (can be combined with -K)\n", "Output options:\n", " --force-exons: make sure that the lowest level GFF features are considered\n", " \"exon\" features\n", " --gene2exon: for single-line genes not parenting any transcripts, add an\n", " exon feature spanning the entire gene (treat it as a transcript)\n", " --t-adopt: try to find a parent gene overlapping/containing a transcript\n", " that does not have any explicit gene Parent\n", " -D decode url encoded characters within attributes\n", " -Z merge very close exons into a single exon (when intron size<4)\n", " -g full path to a multi-fasta file with the genomic sequences\n", " for all input mappings, OR a directory with single-fasta files\n", " (one per genomic sequence, with file names matching sequence names)\n", " -j output the junctions and the corresponding transcripts\n", " -w write a fasta file with spliced exons for each transcript\n", " --w-add for the -w option, extract additional bases\n", " both upstream and downstream of the transcript boundaries\n", " --w-nocds for -w, disable the output of CDS info in the FASTA file\n", " -x write a fasta file with spliced CDS for each GFF transcript\n", " -y write a protein fasta file with the translation of CDS for each record\n", " -W for -w, -x and -y options, write in the FASTA defline all the exon\n", " coordinates projected onto the spliced sequence;\n", " -S for -y option, use '*' instead of '.' as stop codon translation\n", " -L Ensembl GTF to GFF3 conversion, adds version to IDs\n", " -m is a name mapping table for converting reference \n", " sequence names, having this 2-column format:\n", " \n", " -t use in the 2nd column of each GFF/GTF output line\n", " -o write the output records into instead of stdout\n", " -T main output will be GTF instead of GFF3\n", " --bed output records in BED format instead of default GFF3\n", " --tlf output \"transcript line format\" which is like GFF\n", " but with exons and CDS related features stored as GFF \n", " attributes in the transcript feature line, like this:\n", " exoncount=N;exons=;CDSphase=;CDS= \n", " is a comma-delimited list of exon_start-exon_end coordinates;\n", " is CDS_start:CDS_end coordinates or a list like \n", " --table output a simple tab delimited format instead of GFF, with columns\n", " having the values of GFF attributes given in ; special\n", " pseudo-attributes (prefixed by @) are recognized:\n", " @id, @geneid, @chr, @start, @end, @strand, @numexons, @exons, \n", " @cds, @covlen, @cdslen\n", " If any of -w/-y/-x FASTA output files are enabled, the same fields\n", " (excluding @id) are appended to the definition line of corresponding\n", " FASTA records\n", " -v,-E expose (warn about) duplicate transcript IDs and other potential\n", " problems with the given GFF/GTF records\n" ] }, { "ename": "CalledProcessError", "evalue": "Command 'b'${gffread} -h\\n'' returned non-zero exit status 1.", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mCalledProcessError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/tmp/ipykernel_36240/1000630337.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mget_ipython\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun_cell_magic\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'bash'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m''\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'${gffread} -h\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/interactiveshell.py\u001b[0m in \u001b[0;36mrun_cell_magic\u001b[0;34m(self, magic_name, line, cell)\u001b[0m\n\u001b[1;32m 2417\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuiltin_trap\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2418\u001b[0m \u001b[0margs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mmagic_arg_s\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2419\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2420\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2421\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magics/script.py\u001b[0m in \u001b[0;36mnamed_script_magic\u001b[0;34m(line, cell)\u001b[0m\n\u001b[1;32m 140\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 141\u001b[0m \u001b[0mline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mscript\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 142\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshebang\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 143\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 144\u001b[0m \u001b[0;31m# write a basic docstring:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/decorator.py\u001b[0m in \u001b[0;36mfun\u001b[0;34m(*args, **kw)\u001b[0m\n\u001b[1;32m 230\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mkwsyntax\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 231\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkw\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfix\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkw\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msig\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 232\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mcaller\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mextras\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 233\u001b[0m \u001b[0mfun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 234\u001b[0m \u001b[0mfun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__doc__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__doc__\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magic.py\u001b[0m in \u001b[0;36m\u001b[0;34m(f, *a, **k)\u001b[0m\n\u001b[1;32m 185\u001b[0m \u001b[0;31m# but it's overkill for just that one bit of state.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 186\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mmagic_deco\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 187\u001b[0;31m \u001b[0mcall\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mlambda\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 188\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 189\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcallable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magics/script.py\u001b[0m in \u001b[0;36mshebang\u001b[0;34m(self, line, cell)\u001b[0m\n\u001b[1;32m 243\u001b[0m \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstderr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mflush\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 244\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mraise_error\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreturncode\u001b[0m\u001b[0;34m!=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 245\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mCalledProcessError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreturncode\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moutput\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstderr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 246\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 247\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_run_script\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mto_close\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mCalledProcessError\u001b[0m: Command 'b'${gffread} -h\\n'' returned non-zero exit status 1." ] } ], "source": [ "%%bash\n", "${gffread} -h" ] }, { "cell_type": "markdown", "id": "c92e8380-9c6d-47c5-9865-56ad65a09bc6", "metadata": {}, "source": [ "### Document `gtf_extract` options" ] }, { "cell_type": "code", "execution_count": 20, "id": "10d9df64-01a7-4af1-b4f5-c20671021689", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: gtf_extract [-h] [-v] [-f FEATURE_TYPE] [--fields FIELD_LIST]\n", " [-o OUTFILE] [--gff] [-k]\n", " GTF_FILE\n", "\n", "Extract selected data items from a GTF file and output in tab-delimited\n", "format. The program can also operate on GFF files provided the --gff option is\n", "specified.\n", "\n", "positional arguments:\n", " GTF_FILE input GTF file to extract data items from\n", "\n", "optional arguments:\n", " -h, --help show this help message and exit\n", " -v, --version show program's version number and exit\n", " -f FEATURE_TYPE, --feature FEATURE_TYPE\n", " only extract data for lines where feature is\n", " FEATURE_TYPE\n", " --fields FIELD_LIST comma-separated list of fields to output in tab-\n", " delimited format for each line in the GTF, e.g.\n", " 'chrom,start,end'. Fields can either be a GTF field\n", " name (i.e. 'chrom', 'source', 'feature', 'start',\n", " 'end', 'score', 'strand' and 'frame') or the name of\n", " an attribute (e.g. 'gene_name', 'gene_id' etc). Data\n", " items are output in the order they appear in\n", " FIELD_LIST. If a field doesn't exist for a line then\n", " '.' will be output as the value.\n", " -o OUTFILE write output to OUTFILE (default is to write to\n", " stdout)\n", " --gff specify that the input file is GFF rather than GTF\n", " format\n", " -k, --keep-headers copy headers from input file to output\n" ] } ], "source": [ "%%bash\n", "gtf_extract -h" ] }, { "cell_type": "code", "execution_count": null, "id": "4d2ec5e6-6221-4bc9-af10-00ab6cc67183", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }