{ "cells": [ { "cell_type": "markdown", "id": "2949abe4-dfa6-4c4a-a395-a3a1db92b5e7", "metadata": {}, "source": [ "## Validate and Convert _P.verrucosa_ GFF to GTF\n", "\n", "### Notebook relies on:\n", "\n", "- [GffRead](https://github.com/gpertea/gffread)" ] }, { "cell_type": "markdown", "id": "ee0ebae6-d54c-4d18-88bd-3d3456a8b1e6", "metadata": {}, "source": [ "### List computer specs" ] }, { "cell_type": "code", "execution_count": 1, "id": "9f60016c-d6b6-4b6d-86d3-5ee68b55464c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TODAY'S DATE:\n", "Mon Feb 20 12:04:26 PM PST 2023\n", "------------\n", "\n", "Distributor ID:\tUbuntu\n", "Description:\tUbuntu 22.04.1 LTS\n", "Release:\t22.04\n", "Codename:\tjammy\n", "\n", "------------\n", "HOSTNAME: \n", "computer\n", "\n", "------------\n", "Computer Specs:\n", "\n", "Architecture: x86_64\n", "CPU op-mode(s): 32-bit, 64-bit\n", "Address sizes: 45 bits physical, 48 bits virtual\n", "Byte Order: Little Endian\n", "CPU(s): 4\n", "On-line CPU(s) list: 0-3\n", "Vendor ID: GenuineIntel\n", "Model name: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz\n", "CPU family: 6\n", "Model: 165\n", "Thread(s) per core: 1\n", "Core(s) per socket: 1\n", "Socket(s): 4\n", "Stepping: 2\n", "BogoMIPS: 4800.01\n", "Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat flush_l1d arch_capabilities\n", "Hypervisor vendor: VMware\n", "Virtualization type: full\n", "L1d cache: 128 KiB (4 instances)\n", "L1i cache: 128 KiB (4 instances)\n", "L2 cache: 1 MiB (4 instances)\n", "L3 cache: 64 MiB (4 instances)\n", "NUMA node(s): 1\n", "NUMA node0 CPU(s): 0-3\n", "Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported\n", "Vulnerability L1tf: Mitigation; PTE Inversion\n", "Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\n", "Vulnerability Meltdown: Mitigation; PTI\n", "Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\n", "Vulnerability Retbleed: Mitigation; IBRS\n", "Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\n", "Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\n", "Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected\n", "Vulnerability Srbds: Unknown: Dependent on hypervisor status\n", "Vulnerability Tsx async abort: Not affected\n", "\n", "------------\n", "\n", "Memory Specs\n", "\n", " total used free shared buff/cache available\n", "Mem: 54Gi 4.4Gi 42Gi 198Mi 7.4Gi 49Gi\n", "Swap: 2.0Gi 0B 2.0Gi\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "No LSB modules are available.\n" ] } ], "source": [ "%%bash\n", "echo \"TODAY'S DATE:\"\n", "date\n", "echo \"------------\"\n", "echo \"\"\n", "#Display operating system info\n", "lsb_release -a\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"HOSTNAME: \"; hostname \n", "echo \"\"\n", "echo \"------------\"\n", "echo \"Computer Specs:\"\n", "echo \"\"\n", "lscpu\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"\"\n", "echo \"Memory Specs\"\n", "echo \"\"\n", "free -mh" ] }, { "cell_type": "markdown", "id": "19866f68-bfad-4a83-adb5-24e271e29d06", "metadata": {}, "source": [ "### Set variables\n", "- `%env` indicates a bash variable\n", "\n", "- without `%env` is Python variable" ] }, { "cell_type": "code", "execution_count": 14, "id": "7293bcb0-581c-4ad2-8f1e-09dd98352aaf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: data_dir=/home/sam/data/M_capitata/genomes\n", "env: analysis_dir=/home/sam/analyses/20230127-pver-gff_to_gtf\n", "env: gff=Pver_genome_assembly_v1.0.gff3\n", "env: url=https://owl.fish.washington.edu/halfshell/genomic-databank\n", "env: valid_gff=Pver_genome_assembly_v1.0-valid.gff3\n", "env: gtf=Pver_genome_assembly_v1.0-valid.gtf\n", "env: gffread=/home/sam/programs/gffread-0.12.7.Linux_x86_64/gffread\n", "env: break_line=--------------------------------------------------------------------------\n" ] } ], "source": [ "# Set directories, input/output files\n", "%env data_dir=/home/sam/data/M_capitata/genomes\n", "%env analysis_dir=/home/sam/analyses/20230127-pver-gff_to_gtf\n", "analysis_dir=\"20230127-pver-gff_to_gtf\"\n", "\n", "# Input files (from NCBI)\n", "%env gff=Pver_genome_assembly_v1.0.gff3\n", "\n", "# URL of file directory\n", "%env url=https://owl.fish.washington.edu/halfshell/genomic-databank\n", "\n", "# Output file(s)\n", "%env valid_gff=Pver_genome_assembly_v1.0-valid.gff3\n", "%env gtf=Pver_genome_assembly_v1.0-valid.gtf\n", "\n", "\n", "# Set program locations\n", "%env gffread=/home/sam/programs/gffread-0.12.7.Linux_x86_64/gffread\n", "\n", "# Set some formatting stuff\n", "%env break_line=--------------------------------------------------------------------------" ] }, { "cell_type": "markdown", "id": "7f204c16-2d1f-4837-93b0-1fb0e3d00d64", "metadata": {}, "source": [ "### Create analysis directory" ] }, { "cell_type": "code", "execution_count": 3, "id": "8f275e34-c56e-4754-abf7-3279667434bb", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# Make analysis and data directory, if doesn't exist\n", "mkdir --parents \"${analysis_dir}\"\n", "\n", "mkdir --parents \"${data_dir}\"" ] }, { "cell_type": "markdown", "id": "56052f6d-441a-4048-8a6f-39d58552283d", "metadata": {}, "source": [ "### Download GFF" ] }, { "cell_type": "code", "execution_count": 4, "id": "951fc8e9-b821-4f54-848f-f9573daadc83", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 sam sam 71M Mar 23 2020 Pver_genome_assembly_v1.0.gff3\n" ] } ], "source": [ "%%bash\n", "cd \"${data_dir}\"\n", "\n", "# Download with wget.\n", "# Use --quiet option to prevent wget output from printing too many lines to notebook\n", "# Use --continue to prevent re-downloading fie if it's already been downloaded.\n", "# Use --no-check-certificate to avoid download error from gannet\n", "wget --quiet \\\n", "--continue \\\n", "--no-check-certificate \\\n", "${url}/${gff}\n", "\n", "ls -ltrh \"${gff}\"" ] }, { "cell_type": "markdown", "id": "7eb8b7c0-5927-44c5-ba79-cbee1d5a77fb", "metadata": {}, "source": [ "### Examine GFF" ] }, { "cell_type": "markdown", "id": "ae4809dc-a4cd-453a-b83e-22e95a5c7922", "metadata": {}, "source": [ "#### Check first 20 lines" ] }, { "cell_type": "code", "execution_count": 5, "id": "18862291-d1ec-4b62-8b22-959404538a7f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# ORIGINAL: Pver_g1.t2 original gene structure, not modified by PASA\n", "Pver_Sc0000000_size2095917\t.\tgene\t13766\t20466\t.\t+\t.\tID=Pver_gene_g1;Name=Pver_g1.t1\n", "Pver_Sc0000000_size2095917\t.\tmRNA\t13766\t20466\t.\t+\t.\tID=Pver_g1.t2;Parent=Pver_gene_g1;Name=Pver_g1.t1\n", "Pver_Sc0000000_size2095917\t.\tfive_prime_UTR\t13766\t14013\t.\t+\t.\tID=Pver_g1.t2.utr5p1;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\texon\t13766\t14098\t.\t+\t.\tID=Pver_g1.t2.exon1;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\tCDS\t14014\t14098\t.\t+\t0\tID=cds.Pver_g1.t2;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\texon\t16629\t16667\t.\t+\t.\tID=Pver_g1.t2.exon2;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\tCDS\t16629\t16667\t.\t+\t1\tID=cds.Pver_g1.t2;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\texon\t17615\t17698\t.\t+\t.\tID=Pver_g1.t2.exon3;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\tCDS\t17615\t17698\t.\t+\t1\tID=cds.Pver_g1.t2;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\texon\t18109\t18420\t.\t+\t.\tID=Pver_g1.t2.exon4;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\tCDS\t18109\t18420\t.\t+\t1\tID=cds.Pver_g1.t2;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\texon\t18845\t19071\t.\t+\t.\tID=Pver_g1.t2.exon5;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\tCDS\t18845\t19071\t.\t+\t1\tID=cds.Pver_g1.t2;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\texon\t19404\t19581\t.\t+\t.\tID=Pver_g1.t2.exon6;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\tCDS\t19404\t19581\t.\t+\t0\tID=cds.Pver_g1.t2;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\texon\t19848\t20466\t.\t+\t.\tID=Pver_g1.t2.exon7;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\tCDS\t19848\t19873\t.\t+\t1\tID=cds.Pver_g1.t2;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\tthree_prime_UTR\t19874\t20466\t.\t+\t.\tID=Pver_g1.t2.utr3p1;Parent=Pver_g1.t2\n", "#PROT Pver_g1.t2 Pver_gene_g1\tMLTRYCLGSQKLTPLIGNVTVTFIIPEKELSQPSCILFNFQDFKTHLSKIMMYSPLTFVLFVALTFQSTVAIEYSRIGCYRDTLVKPRPLPELIENFRGGRVDWNNLNNTIAACAEAAKKKGYLYFGLQFYGECWSGPQAQLTYARDGPSKNCSKGVGEERANFVYKIKLLEKENECTTYRVLDSADRSKTNVNTVSQGDKCDHWNSGFVRNAWYRFTGAAGQTMADECVQAGSCQTTMAGWMNGTHPKVFDGIQRRKACFSSESNPYKRQNNNCCERQIYIHVRNCGEFYVYKLPSTPGCFLRYCGSGVSQNKNA*\n" ] } ], "source": [ "%%bash\n", "head -n 20 \"${data_dir}\"/\"${gff}\"" ] }, { "cell_type": "markdown", "id": "e9c2094b-44f9-48b2-8a8a-9c644e11e670", "metadata": {}, "source": [ "#### Count unique number of fields\n", "\n", "This identifies if there are rows with >9 fields (which there shouldn't be in a [GFF3](http://gmod.org/wiki/GFF3))." ] }, { "cell_type": "code", "execution_count": 15, "id": "dce4a4c5-8efc-40e6-8d98-1a450398fa9d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "List of number of fields in /home/sam/data/M_capitata/genomes/Pver_genome_assembly_v1.0.gff3:\n", "\n", "1\n", "10\n", "11\n", "2\n", "9\n", "\n", "--------------------------------------------------------------------------\n", "\n", "\n", "Preview of lines with 1 field(s):\n", "\n", "# ORIGINAL: Pver_g1.t2 original gene structure, not modified by PASA\n", "# PASA_UPDATE: Pver_g2.t1, single gene model update, valid-1, status:[pasa:asmbl_2,status:12], valid-1\n", "# PASA_UPDATE: Pver_g3.t1, single gene model update, valid-1, status:[pasa:asmbl_3,status:8], valid-1\n", "# PASA_UPDATE: Pver_g4.t1, single gene model update, valid-1, status:[pasa:asmbl_4,status:13], valid-1\n", "# ORIGINAL: Pver_g5.t1 original gene structure, not modified by PASA\n", "# PASA_UPDATE: Pver_g6.t1, single gene model update, valid-1, status:[pasa:asmbl_6,status:8], valid-1\n", "# ORIGINAL: Pver_g7.t1 original gene structure, not modified by PASA\n", "# PASA_UPDATE: Pver_g8.t1, single gene model update, valid-1, status:[pasa:asmbl_8,status:8], valid-1\n", "# PASA_UPDATE: Pver_g9.t1, single gene model update, valid-1, status:[pasa:asmbl_11,status:13], valid-1\n", "# PASA_UPDATE: Pver_g10.t1, single gene model update, valid-1, status:[pasa:asmbl_12,status:13], valid-1\n", "\n", "--------------------------------------------------------------------------\n", "\n", "Preview of lines with 10 field(s):\n", "\n", "Pver_Sc0000004_size1560107\t.\tCDS\t1538\t1660\t.\t-\t2\tID=cds.Pver_g535.t1;Parent=Pver_g535.t1\t3_prime_partial true\n", "Pver_Sc0000005_size1451149\t.\tCDS\t6000\t6369\t.\t+\t0\tID=cds.Pver_g658.t1;Parent=Pver_g658.t1\t5_prime_partial true\n", "Pver_Sc0000015_size1181740\t.\tCDS\t1180243\t1180328\t.\t+\t0\tID=cds.Pver_g1699.t2;Parent=Pver_g1699.t2\t3_prime_partial true\n", "Pver_Sc0000020_size1082251\t.\tCDS\t991\t1066\t.\t+\t0\tID=cds.Pver_g1987.t2;Parent=Pver_g1987.t2\t5_prime_partial true\n", "Pver_Sc0000023_size1082203\t.\tCDS\t1\t215\t.\t-\t2\tID=cds.Pver_g2184.t1;Parent=Pver_g2184.t1\t3_prime_partial true\n", "Pver_Sc0000034_size904800\t.\tCDS\t3640\t4620\t.\t-\t0\tID=cds.Pver_g2932.t1;Parent=Pver_g2932.t1\t3_prime_partial true\n", "Pver_Sc0000040_size808600\t.\tCDS\t2\t248\t.\t+\t0\tID=cds.Pver_g3290.t1;Parent=Pver_g3290.t1\t5_prime_partial true\n", "Pver_Sc0000048_size787391\t.\tCDS\t4784\t4886\t.\t+\t0\tID=cds.Pver_g3688.t1;Parent=Pver_g3688.t1\t5_prime_partial true\n", "Pver_Sc0000053_size770660\t.\tCDS\t769075\t769758\t.\t+\t2\tID=cds.Pver_g3960.t1;Parent=Pver_g3960.t1\t3_prime_partial true\n", "Pver_Sc0000055_size736968\t.\tCDS\t350\t452\t.\t+\t0\tID=cds.Pver_g4033.t1;Parent=Pver_g4033.t1\t5_prime_partial true\n", "\n", "--------------------------------------------------------------------------\n", "\n", "Preview of lines with 11 field(s):\n", "\n", "Pver_Sc0002193_size7256\t.\tCDS\t1\t1630\t.\t-\t0\tID=cds.Pver_g20342.t1;Parent=Pver_g20342.t1\t5_prime_partial true\t3_prime_partial true\n", "Pver_Sc0002913_size4362\t.\tCDS\t1\t1500\t.\t-\t0\tID=cds.Pver_g20701.t1;Parent=Pver_g20701.t1\t5_prime_partial true\t3_prime_partial true\n", "Pver_Sc0002951_size4270\t.\tCDS\t1321\t4270\t.\t+\t0\tID=cds.Pver_g20712.t1;Parent=Pver_g20712.t1\t5_prime_partial true\t3_prime_partial true\n", "Pver_Sc0003413_size3396\t.\tCDS\t1\t3394\t.\t-\t0\tID=cds.Pver_g20894.t1;Parent=Pver_g20894.t1\t5_prime_partial true\t3_prime_partial true\n", "Pver_Sc0003758_size2910\t.\tCDS\t1308\t2908\t.\t-\t0\tID=cds.Pver_g21008.t1;Parent=Pver_g21008.t1\t5_prime_partial true\t3_prime_partial true\n", "Pver_Sc0005005_size1901\t.\tCDS\t2\t1901\t.\t+\t0\tID=cds.Pver_g21369.t1;Parent=Pver_g21369.t1\t5_prime_partial true\t3_prime_partial true\n", "Pver_Sc0005269_size1786\t.\tCDS\t1\t1785\t.\t-\t0\tID=cds.Pver_g21442.t1;Parent=Pver_g21442.t1\t5_prime_partial true\t3_prime_partial true\n", "Pver_Sc0006090_size1463\t.\tCDS\t2\t1463\t.\t+\t0\tID=cds.Pver_g21627.t1;Parent=Pver_g21627.t1\t5_prime_partial true\t3_prime_partial true\n", "Pver_Sc0006735_size1286\t.\tCDS\t1\t1285\t.\t-\t0\tID=cds.Pver_g21746.t1;Parent=Pver_g21746.t1\t5_prime_partial true\t3_prime_partial true\n", "Pver_Sc0007030_size1213\t.\tCDS\t1\t1213\t.\t-\t0\tID=cds.Pver_g21783.t1;Parent=Pver_g21783.t1\t5_prime_partial true\t3_prime_partial true\n", "\n", "--------------------------------------------------------------------------\n", "\n", "Preview of lines with 2 field(s):\n", "\n", "#PROT Pver_g1.t2 Pver_gene_g1\tMLTRYCLGSQKLTPLIGNVTVTFIIPEKELSQPSCILFNFQDFKTHLSKIMMYSPLTFVLFVALTFQSTVAIEYSRIGCYRDTLVKPRPLPELIENFRGGRVDWNNLNNTIAACAEAAKKKGYLYFGLQFYGECWSGPQAQLTYARDGPSKNCSKGVGEERANFVYKIKLLEKENECTTYRVLDSADRSKTNVNTVSQGDKCDHWNSGFVRNAWYRFTGAAGQTMADECVQAGSCQTTMAGWMNGTHPKVFDGIQRRKACFSSESNPYKRQNNNCCERQIYIHVRNCGEFYVYKLPSTPGCFLRYCGSGVSQNKNA*\n", "#PROT Pver_g2.t1 Pver_gene_g2\tMITPQFLLLQFFLSCLIAAEENEKINLFRDEEDFHYEPIGCFGDKLDVPRPLPLLIRNYRSRPYRVDWNNINNTIQACAKEVKKAGYVYFGLQFYGECWSGPHAHLTYDEDGKSTRCVNGVGKRMANFVYRLVFKECKEYRILQAPDRSIHHPYTPSAPCDISLKPSWYRFEGRAGTAMANSCPSRFKCGTIVPGWLNGPLPSVQDGIVSREICFNQDRDCCFTSTQAKVRNCAGFYVFYLRSAPYCQLRYCGNGRIG*\n", "#PROT Pver_g3.t1 Pver_gene_g3\tMNLLVLCSVFLLYFGYYTRCADASQYVKVGCFRDKLSPRARALPELLANYRGNINWNNLMEVVEKCAKKAKEKNYMYFAIQFYGECWSGATAPKTYDRYGRSSPCTSNSVVGTALTNVVYRFAGDEQECVKYEELNKVDRSESSRLISGTSPACDKELKPGWYRFQKPAGNLMASKCLPSARCGTVVTGWLSSPHPKMTDGIVNGKVCYSWKSDCCKWDQDIKLRNCGRFYVYKLAGTQACPMGFCGSAASV*\n", "#PROT Pver_g4.t1 Pver_gene_g4\tMKLKTLAAVAFLCLEYSLSLTDASQFVKVGCFKDKRSPRARALPELIANYRGNIDWNNTMEVVDKCAKRAKEKNYMYFGIQFYGECWSGATAPLTYDRYGRSLHCTSEVGGMHANLVYRFVGEESECLQYLQLSTRDRSVESDPPAGLAAICDNSLSPRWYRFLRPAGDKMASKCVPTQKCGTAVTGWLSSPHPKVSDGIVNGKVCFHWLNECCFRDQVIKVRDCGRFFVYKLNKPEGCPMRFCGSNVE*\n", "#PROT Pver_g5.t1 Pver_gene_g5\tMRRIMEVLYSLLVIFFMADTCCSQTCKNANWWHTFDREGWSYCDSENQYITGLWRNDNKGSNDGIYLIEHAKCCFAPYGVHAQDIPASCKKANWWKVLDGTNKWATCPDGYFMGGLYRTGKHHWLHNIEEARCCKPVGLPEKYADCYDENVWSSFDGKGLSECKRKGYYMAGIFRGECDKLYCLEKFKCCRMIAFMASSQQRYNIGVKSKALITLSIALVSFSHSQEVAVRDSEPQSKQLTPIPEFRPIGCFVDSGAIPRPIPKLVANFRGNIDWHNLNKTVLDCARRVNSKGFRYFGIQFYGECWSGENAELTYNKQGTSKNCFRGLGERKANFVYAFVVKEINARLANGSSSRSGRVEVFHRGSWGAVCGKEWDIRDADVVCRMLGYPGALNAYQDDRYRSGKGRVRLGDVQCNGNEENLAFCPYKDYSVCSQSGEAGVKCRSVSDSIQPRIPYRLSGGNSRSEGRVEVYHDRKWGTICDRGWDLIDAGIFCRMLAYTGARATPKYGQGSGPIWLSDVKCIGTEDSIFRCENAGWKNVDNCDHSNDAGAECYYD*\n", "#PROT Pver_g6.t1 Pver_gene_g6\tMKSAIYSLLIILFLADTCCSQTCSNANWWISFDGEGWSYCDHKNQYMTGLWRNDAKGSDDGIYLIEYAKCCVAPYGMKDVPASCKTANWWGVLDSNNKWATCPDGYFMGGLYRTGDDPWLHNIEEARCCKPEGLPDRYADCYIENVWGSFDGKGLSECKREGYYMAGIFRGDCDKLYCLEEFKCCRMIAIEGPPMLADETKSKE*\n", "#PROT Pver_g7.t1 Pver_gene_g7\tMNIILFKSLKVILERSETILCSPYYPNIHRIEKKRTKDTVNFIMKKFLLFSMATLMMMDQCWSQCRKANWWGSFDKEGWSKCASSVEYLKGFYRNNKNNNDPIYLLEEGRCCKAPPPNQNQASTCKNANWWGVLDKTNRWAFCPTGYFLQGLYRSKNHNIHNIEEGHCCKPNNLPSSYLRCYEHDISSSFDNKGWSECDSDHYLTGVYRGGCDKLQCIEKIKCCMMPDSCKMANWWKAFDKKGWVQCDSTKHYITGLYRNNNWGKNDKIFLLEEAKCCPAPPPYQNTGSTCRDANWWGVLDKTNSWAVCPAGYFVRGFYRNNGAWLHHLEMGKCCKPNGFPDRYEHCYNEDVKSSFDRRGLSKCQREGYYLAGIFRGGCDYLYCIESFKCCKMNVDECRTKNPCQNGAACSDQPGTYKCTCKSGFTGKNCESDINECSKSPCKNGAKCVNLKGSYRCDCKSGYTGKNCESDKNECSANPCKNGATCVNLQGSYRCDCKSGYTGNHCESDKNECSTSPCKNGATCVNLQGSYRCDCKSGYTGKHCDSDVNECSNNPCKHGATCVNLQGGHRCDCKSGYTGSSCESDINECSNSPCKNGATCVNLQGSYRCDCKTGYTGNNCETDVNECSNNPCKNGATCVNIQGGHRCDCKSGYTGISCESDINECSNSPCKNGATCVNLSGSYRCDCKTGYTGNNCETDIDECSNNPCQNGALCANLQGSYRCDCKTGYTGNQCQTDINECAPAPCQNGGTCVDLVGSFRCDCPAEFEGANCENAAENGIEECEM*\n", "#PROT Pver_g8.t1 Pver_gene_g8\tMKLLLASLVALIISDVFTTLEGKPSTGRSFLKPRILSRGRRAVPDVSSIRIDIRSEGCNDPGKTPNTCGRAYIKVNGNEHSKKKRGHNVVVLDAITGIVEHSVSFDTHGSTAAANQLKDFLNGIAGDKIILIAVQDEGSRYLKPAFDALMKIGAHDLDFLNHRGSYALVGYSREKKPSYVQQVQNKGGKGPSVISTTVPLTKNPFVDIDIRSEGCNDPNKKPDTCGIAYIKVDGKDHSLHRRGHNVVIVERKTGRVLKSEAFDTHGDGSAGTRLKDFLNAQGEDKIVLVAVQDSAASHVGVALDSLRRVGAIDPILVEFRGSYALIGTLDANKPPWVTQDQHPRYKGPSEISIRIPLSDCQKALGMENYGIPNGKVRASSEWDSNHAAIQGRLHYKPPRGKQGAWSARHNNINQWLQVDLGSAFIKVTGVATQGRYNYDQWVTKYKLQYSNDGATFTYYKEPGQTADMVFSGNTDRNTVVYHFLNAPVTARYIRFRPVTWSGHISMRVELYGCSACEKALGMASYAIPNGQVKASSEWDPNHAAIQGRLHYLPPPGKQGGWSAKYNKANQWLQIDLGALFRVTAVATQGRSNYNQWVTKYKLQYSDNGATFTYYMEAGQNVAKEFVANKDRNTVVYHSLNPPKTTHFIRFRPVAWKSHISMRVEVYGCSACGEALGMASYAIPNGQVKASSEWDPNHAAIQGRLHYLPPPGKQGGWSAKHNNANQWLQIDLGALLKVTAVATQGRSNHDQWVTKYKLQYSDDGATFTYYMEAGQSVAKEFVANKDRSSVVYHSLNPPKLTRFIRFRPVSWKSHISMRVEVYGCSARDYCAQNPCKNGATCSNVEEGYQCTCKPGYSGAQCDQDINECTNSPCQNGATCVNLQGSYRCDCKSGYNGNKCENDINECSNNPCKNGATCVNLQGSYSCDCKSGYNGNNCENDINECTNSPCKNGATCVNLQGSYRCDCKSGYDGNNCETDINECSNNPCKNGATCVNLQGSHRCDCKSGYDGNNCENDINECSNSPCKNGATCVNLVGSYRCDCKSGYDGNNCENDVNECTNNPCKNGATCVNLSGGHRCDCAKGYSGSSCETAINYCAPDPCLNGGTCVDLVDGFRCDCAAGWLGIICDEPEVDECEE*\n", "#PROT Pver_g9.t1 Pver_gene_g9\tMEFAFNINDLLPDLITVVDDKLAPFRSRIKNDRYEFANKQSQLKIVIDHMGEASTRAQGLRAPITSALKLQHSDHRLYVMKDSAANNGQGAVVGILKVGHKKLFLMDMQSVQYEVNPLCVLDFYVHESRQRTGCGKRLFEHMLKMEGRMVYQLAIDRPSHKFVSFLKKYHGFKNAIPQANNFVIHEHFFSDLQAVGHVPRRSYRFANASLSGKPPIHTYRKTSHARDTPPLLLRRQSSSSRQSSRPNSGKEFSHDTAHSGESEPSALQRINNSLQDINLPPVVPRPSSRNSAPGSRGSSAGRRSVTRQMELGTPEATGHVSSETYNASRGLAARATSYSRHSNRGNSADFSKTKKDVVVTGTKVSDNFYRNNSGGRARSLHLTSVENPFLGHGVMEKSSLQRKQERTGTQEQMASTMVPPEEGAQDQNFNITDKKETSETSQQSGITPSGSFTVGTMFKAHFGNRQPFIGSSWNIFGVPTFMNKDSNSAHYAYTRRTQSRHHPF*\n", "#PROT Pver_g10.t1 Pver_gene_g10\tMIAETVEGWLTILAEDGNEWLSRLCVLFSNEKKLRTFVDDVEYLDCARFLAATAKIKAKLCDSPDLILNPTHEGLSRAHRLSRRSRSENALSAALNGNRSKLSDFFSKKIADGKRNRSLTRSLFLPGTNNYDSGFSSLSPPTIPRWRSHETLESKMKTTSPDSVDLSISGHVTVRPIHPSILSRQHCFQVTTPHETKYFSCRSSQELEKWVVSIKKSIQPTRDVQYRTDIGLTVWIVEAKGLTDKPKRRFYCELFFDKVLYAKTSSKTKSDILFWGENFAFNDLSEVKTVTVQIYKETDGKNKKDKRKPVGSVDVPVDSLEMGVEVEKWYPVTMESNKTNGGEASSIRLRFKYQKVSILPVRSYVELSEYLNRNYMTICKALEPVVSVKQKDEISRVLLRILQACGKATEFLSTIVMDEVNNLEDENLVFRGNSFATKAMDSYMKMIGESYLQDTLGDFVQSVYEFEEDCEVDPSKKSSGNLESNKENLKTFVEKAWNLIMCSACLFPSDLKVTFHEFRRWCEGRSEDLSTKLISGSIFLRFLCPAILSPSLFHLVQEYPDERTSRALTLIAKVIQNLANFTRFGGKEEYMDFMNGFVEKEFVNMRKFLNEISSRTDTTENHYEGIIDIGRELALMFNMLKDQTSKMNQDALETLRPLQFILKSLQEIYHDSERKFAVGNESLVAHRKWASSSDLVSDQDKNNITLTPVSQIRIMKSPNKRTQNPPIRTSSESNLVFNGGTAATTPEHLMTSRYDESILSPISDIDPPSLSPRLRANYDDDLQSSFRRRSYRRAIASDEVSKAIPGAIERERSSSERRSLTRNSDSHRRSQSESPHVISSVDSGSPTEARVIYKGELSNSRSSLKSKESANHRHYRAVYDSRQRDGKDGCLKRRSLPRGVPPGSSPIITDRSRTSSPEPPPPPRPTAPKPLLSSASVDPSHSSPRLNNHDSSRFLNPAVGRVARSPSGTSSSSSGSEESWHSVGSSVEGGGGGGGLARRRARKMPRTPSPEDRRRGSGSNPWERMPPRAEDAFDNGYDVPPPKTVAEYEKELQDLREELLQTKRTLASTHEQLILQESSTHKLVASFKERLAESESSLQQLRQEKDKEMKELLDRLVSVETELRQEQREMQEVIQAKQVIIEAQERRIKSTDSSNAKLIATLNQVGGSRASNKVLNYPSSDL*\n", "\n", "--------------------------------------------------------------------------\n", "\n", "Preview of lines with 9 field(s):\n", "\n", "Pver_Sc0000000_size2095917\t.\tgene\t13766\t20466\t.\t+\t.\tID=Pver_gene_g1;Name=Pver_g1.t1\n", "Pver_Sc0000000_size2095917\t.\tmRNA\t13766\t20466\t.\t+\t.\tID=Pver_g1.t2;Parent=Pver_gene_g1;Name=Pver_g1.t1\n", "Pver_Sc0000000_size2095917\t.\tfive_prime_UTR\t13766\t14013\t.\t+\t.\tID=Pver_g1.t2.utr5p1;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\texon\t13766\t14098\t.\t+\t.\tID=Pver_g1.t2.exon1;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\tCDS\t14014\t14098\t.\t+\t0\tID=cds.Pver_g1.t2;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\texon\t16629\t16667\t.\t+\t.\tID=Pver_g1.t2.exon2;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\tCDS\t16629\t16667\t.\t+\t1\tID=cds.Pver_g1.t2;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\texon\t17615\t17698\t.\t+\t.\tID=Pver_g1.t2.exon3;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\tCDS\t17615\t17698\t.\t+\t1\tID=cds.Pver_g1.t2;Parent=Pver_g1.t2\n", "Pver_Sc0000000_size2095917\t.\texon\t18109\t18420\t.\t+\t.\tID=Pver_g1.t2.exon4;Parent=Pver_g1.t2\n", "\n", "--------------------------------------------------------------------------\n", "\n" ] } ], "source": [ "%%bash\n", "# Capture number of fields (NF) in each row in array.\n", "field_count_array=($(awk -F \"\\t\" '{print NF}' \"${data_dir}/${gff}\" | sort --unique))\n", "\n", "# Check array contents\n", "echo \"List of number of fields in ${data_dir}/${gff}:\"\n", "echo \"\"\n", "for field_count in \"${field_count_array[@]}\"\n", "do\n", " echo \"${field_count}\"\n", "done\n", "\n", "echo \"\"\n", "echo \"${break_line}\"\n", "echo \"\"\n", "\n", "# Preview of each line \"type\" with a given number of fields\n", "# Check array contents\n", "echo \"\"\n", "for field_count in \"${field_count_array[@]}\"\n", "do\n", " echo \"Preview of lines with ${field_count} field(s):\"\n", " echo \"\"\n", " awk \\\n", " -v field_count=\"${field_count}\" \\\n", " -F \"\\t\" \\\n", " 'NF == field_count' \\\n", " \"${data_dir}/${gff}\" \\\n", " | head\n", " echo \"\"\n", " echo \"${break_line}\"\n", " echo \"\"\n", "done\n" ] }, { "cell_type": "markdown", "id": "9b1897f5-dc4d-419d-9709-1f530e9471fe", "metadata": {}, "source": [ "In the above results, we can see that there are rows with 10 and 11 fields, due to incorrect tabs entered in Field 9. This creates an invalid GFF3 file." ] }, { "cell_type": "markdown", "id": "562086c0-af24-4a27-9146-337b4addef00", "metadata": {}, "source": [ "### Fix invalid tabs in Field 9\n", "\n", "This will remove tabs in Field 9 and join those extra fields into Field 9 using a semi-colon separator with the `Note=` attribute." ] }, { "cell_type": "code", "execution_count": 19, "id": "84df9476-cb19-452f-b11d-786b38a1fdca", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "real\t63m22.307s\n", "user\t62m10.233s\n", "sys\t12m37.124s\n" ] } ], "source": [ "%%bash\n", "time \\\n", "while read -r line\n", "do\n", " # Count number of fields\n", " number_of_fields=$(echo \"${line}\" | awk -F \"\\t\" '{print NF}')\n", "\n", " # If number of fields is 9, print the entire line\n", " if [[ \"${number_of_fields}\" -eq 9 ]]; then\n", " echo \"${line}\"\n", " # If there are 10 fields, cut/capture the first 9 and\n", " # capture the 10th.\n", " # Use printf to join 10th field to 9th via semi-colon\n", " elif [[ \"${number_of_fields}\" -eq 10 ]]; then\n", " first_nine=$(echo \"${line}\" | cut -f 1-9)\n", " tenth=$(echo \"${line}\" | cut -f 10 )\n", " printf \"%s;%s\\n\" \"${first_nine}\" \"Note=${tenth}\"\n", " # If there are 11 fields, cut/capture the first 9, followed\n", " # by capturing the 10th and 11th fields.\n", " # Use printf to join 10th and 11th field to 9th via semi-colon.\n", " elif [[ \"${number_of_fields}\" -eq 11 ]]; then\n", " first_nine=$(echo \"${line}\" | cut -f 1-9)\n", " tenth=$(echo \"${line}\" | cut -f 10)\n", " eleventh=$(echo \"${line}\" | cut -f 11)\n", " printf \"%s;%s;%s\\n\" \"${first_nine}\" \"Note=${tenth}\" \"Note=${eleventh}\"\n", " else\n", " # If the line doesn't match any of the above, just print it.\n", " echo \"${line}\"\n", " fi\n", "\n", "done < \"${data_dir}/${gff}\" > \"${analysis_dir}/${valid_gff}\"" ] }, { "cell_type": "markdown", "id": "d25f220e-4d5f-4625-bb7c-cb3cc47e7bbb", "metadata": {}, "source": [ "#### Compare line numbers between the two GFFs to make sure we didn't lose anything" ] }, { "cell_type": "code", "execution_count": 20, "id": "7846217f-3515-4c51-a45a-e45581736eb2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "575514 /home/sam/data/M_capitata/genomes/Pver_genome_assembly_v1.0.gff3\n", "575514 /home/sam/analyses/20230127-pver-gff_to_gtf/Pver_genome_assembly_v1.0-valid.gff3\n" ] } ], "source": [ "%%bash\n", "wc -l \"${data_dir}/${gff}\"\n", "wc -l \"${analysis_dir}/${valid_gff}\"" ] }, { "cell_type": "markdown", "id": "498c69c2-16f4-4aa8-9c18-3c29f70cba8e", "metadata": {}, "source": [ "#### Count fields to make sure no more rows with >9 fields" ] }, { "cell_type": "code", "execution_count": 21, "id": "4a657a55-d17f-4d52-bde8-740dcc768c44", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "2\n", "9\n" ] } ], "source": [ "%%bash\n", "awk -F \"\\t\" '{print NF}' \"${analysis_dir}/${valid_gff}\" | sort --unique" ] }, { "cell_type": "markdown", "id": "b1598092-a1d5-4c8d-ab1c-3bd9d8fcc73a", "metadata": {}, "source": [ "### Convert GFF to GTF" ] }, { "cell_type": "code", "execution_count": 22, "id": "9cf5e7d6-c5eb-4bc5-af4f-b2955c4de478", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "cd \"${data_dir}\"\n", "\n", "${gffread} -E \\\n", "\"${analysis_dir}/${valid_gff}\" -T \\\n", "1> ${analysis_dir}/\"${gtf}\" \\\n", "2> ${analysis_dir}/gffread-gff_to_gtf.stderr" ] }, { "cell_type": "markdown", "id": "e50df042-59c0-4327-b3a4-f14399eba05f", "metadata": {}, "source": [ "### Inspect GTF" ] }, { "cell_type": "code", "execution_count": 23, "id": "98a870cf-518f-44e5-8c7a-eb32c219ea68", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pver_Sc0000000_size2095917\t.\ttranscript\t13766\t20466\t.\t+\t.\ttranscript_id \"Pver_g1.t2\"; gene_id \"Pver_gene_g1\"\n", "Pver_Sc0000000_size2095917\t.\texon\t13766\t14098\t.\t+\t.\ttranscript_id \"Pver_g1.t2\"; gene_id \"Pver_gene_g1\";\n", "Pver_Sc0000000_size2095917\t.\texon\t16629\t16667\t.\t+\t.\ttranscript_id \"Pver_g1.t2\"; gene_id \"Pver_gene_g1\";\n", "Pver_Sc0000000_size2095917\t.\texon\t17615\t17698\t.\t+\t.\ttranscript_id \"Pver_g1.t2\"; gene_id \"Pver_gene_g1\";\n", "Pver_Sc0000000_size2095917\t.\texon\t18109\t18420\t.\t+\t.\ttranscript_id \"Pver_g1.t2\"; gene_id \"Pver_gene_g1\";\n", "Pver_Sc0000000_size2095917\t.\texon\t18845\t19071\t.\t+\t.\ttranscript_id \"Pver_g1.t2\"; gene_id \"Pver_gene_g1\";\n", "Pver_Sc0000000_size2095917\t.\texon\t19404\t19581\t.\t+\t.\ttranscript_id \"Pver_g1.t2\"; gene_id \"Pver_gene_g1\";\n", "Pver_Sc0000000_size2095917\t.\texon\t19848\t20466\t.\t+\t.\ttranscript_id \"Pver_g1.t2\"; gene_id \"Pver_gene_g1\";\n", "Pver_Sc0000000_size2095917\t.\tCDS\t14014\t14098\t.\t+\t0\ttranscript_id \"Pver_g1.t2\"; gene_id \"Pver_gene_g1\";\n", "Pver_Sc0000000_size2095917\t.\tCDS\t16629\t16667\t.\t+\t2\ttranscript_id \"Pver_g1.t2\"; gene_id \"Pver_gene_g1\";\n" ] } ], "source": [ "%%bash\n", "head ${analysis_dir}/\"${gtf}\"" ] }, { "cell_type": "markdown", "id": "26445ae7-6472-4814-bde2-33c646ffcf22", "metadata": {}, "source": [ "### Generate checksum(s)" ] }, { "cell_type": "code", "execution_count": 24, "id": "9a8e9dcb-eb74-4af4-8924-0553fe6f3596", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "9828841097f80b93fc2c599b76432b7c gffread-gff_to_gtf.stderr\n", "5dd8f21a4faea1f46c48a5ab253749d7 Pver_genome_assembly_v1.0-valid.gff3\n", "c3cc8fb576bcf39dd17b6d229100aa56 Pver_genome_assembly_v1.0-valid.gtf\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "\n", "for file in *\n", "do\n", " md5sum \"${file}\" | tee --append checksums.md5\n", "done" ] }, { "cell_type": "markdown", "id": "3eed5b68-470e-4b9a-af97-9ba2f27a51b5", "metadata": {}, "source": [ "### Document GffRead program options" ] }, { "cell_type": "code", "execution_count": 25, "id": "61be6360-3e83-4cd9-a1db-4554995b8771", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "gffread v0.12.7. Usage:\n", "gffread [-g | ] [-s ] \n", " [-o ] [-t ] [-r []:- [-R]]\n", " [--jmatch :-] [--no-pseudo] \n", " [-CTVNJMKQAFPGUBHZWTOLE] [-w ] [-x ] [-y ]\n", " [-j ][--ids | --nids ] [--attrs ] [-i ]\n", " [--stream] [--bed | --gtf | --tlf] [--table ] [--sort-by ]\n", " [] \n", "\n", " Filter, convert or cluster GFF/GTF/BED records, extract the sequence of\n", " transcripts (exon or CDS) and more.\n", " By default (i.e. without -O) only transcripts are processed, discarding any\n", " other non-transcript features. Default output is a simplified GFF3 with only\n", " the basic attributes.\n", " \n", "Options:\n", " --ids discard records/transcripts if their IDs are not listed in \n", " --nids discard records/transcripts if their IDs are listed in \n", " -i discard transcripts having an intron larger than \n", " -l discard transcripts shorter than bases\n", " -r only show transcripts overlapping coordinate range ..\n", " (on chromosome/contig , strand if provided)\n", " -R for -r option, discard all transcripts that are not fully \n", " contained within the given range\n", " --jmatch only output transcripts matching the given junction\n", " -U discard single-exon transcripts\n", " -C coding only: discard mRNAs that have no CDS features\n", " --nc non-coding only: discard mRNAs that have CDS features\n", " --ignore-locus : discard locus features and attributes found in the input\n", " -A use the description field from and add it\n", " as the value for a 'descr' attribute to the GFF record\n", " -s is a tab-delimited file providing this info\n", " for each of the mapped sequences:\n", " \n", " (useful for -A option with mRNA/EST/protein mappings)\n", "Sorting: (by default, chromosomes are kept in the order they were found)\n", " --sort-alpha : chromosomes (reference sequences) are sorted alphabetically\n", " --sort-by : sort the reference sequences by the order in which their\n", " names are given in the file\n", "Misc options: \n", " -F keep all GFF attributes (for non-exon features)\n", " --keep-exon-attrs : for -F option, do not attempt to reduce redundant\n", " exon/CDS attributes\n", " -G do not keep exon attributes, move them to the transcript feature\n", " (for GFF3 output)\n", " --attrs only output the GTF/GFF attributes listed in \n", " which is a comma delimited list of attribute names to\n", " --keep-genes : in transcript-only mode (default), also preserve gene records\n", " --keep-comments: for GFF3 input/output, try to preserve comments\n", " -O process other non-transcript GFF records (by default non-transcript\n", " records are ignored)\n", " -V discard any mRNAs with CDS having in-frame stop codons (requires -g)\n", " -H for -V option, check and adjust the starting CDS phase\n", " if the original phase leads to a translation with an \n", " in-frame stop codon\n", " -B for -V option, single-exon transcripts are also checked on the\n", " opposite strand (requires -g)\n", " -P add transcript level GFF attributes about the coding status of each\n", " transcript, including partialness or in-frame stop codons (requires -g)\n", " --add-hasCDS : add a \"hasCDS\" attribute with value \"true\" for transcripts\n", " that have CDS features\n", " --adj-stop stop codon adjustment: enables -P and performs automatic\n", " adjustment of the CDS stop coordinate if premature or downstream\n", " -N discard multi-exon mRNAs that have any intron with a non-canonical\n", " splice site consensus (i.e. not GT-AG, GC-AG or AT-AC)\n", " -J discard any mRNAs that either lack initial START codon\n", " or the terminal STOP codon, or have an in-frame stop codon\n", " (i.e. only print mRNAs with a complete CDS)\n", " --no-pseudo: filter out records matching the 'pseudo' keyword\n", " --in-bed: input should be parsed as BED format (automatic if the input\n", " filename ends with .bed*)\n", " --in-tlf: input GFF-like one-line-per-transcript format without exon/CDS\n", " features (see --tlf option below); automatic if the input\n", " filename ends with .tlf)\n", " --stream: fast processing of input GFF/BED transcripts as they are received\n", " ((no sorting, exons must be grouped by transcript in the input data)\n", "Clustering:\n", " -M/--merge : cluster the input transcripts into loci, discarding\n", " \"redundant\" transcripts (those with the same exact introns\n", " and fully contained or equal boundaries)\n", " -d : for -M option, write duplication info to file \n", " --cluster-only: same as -M/--merge but without discarding any of the\n", " \"duplicate\" transcripts, only create \"locus\" features\n", " -K for -M option: also discard as redundant the shorter, fully contained\n", " transcripts (intron chains matching a part of the container)\n", " -Q for -M option, no longer require boundary containment when assessing\n", " redundancy (can be combined with -K); only introns have to match for\n", " multi-exon transcripts, and >=80% overlap for single-exon transcripts\n", " -Y for -M option, enforce -Q but also discard overlapping single-exon \n", " transcripts, even on the opposite strand (can be combined with -K)\n", "Output options:\n", " --force-exons: make sure that the lowest level GFF features are considered\n", " \"exon\" features\n", " --gene2exon: for single-line genes not parenting any transcripts, add an\n", " exon feature spanning the entire gene (treat it as a transcript)\n", " --t-adopt: try to find a parent gene overlapping/containing a transcript\n", " that does not have any explicit gene Parent\n", " -D decode url encoded characters within attributes\n", " -Z merge very close exons into a single exon (when intron size<4)\n", " -g full path to a multi-fasta file with the genomic sequences\n", " for all input mappings, OR a directory with single-fasta files\n", " (one per genomic sequence, with file names matching sequence names)\n", " -j output the junctions and the corresponding transcripts\n", " -w write a fasta file with spliced exons for each transcript\n", " --w-add for the -w option, extract additional bases\n", " both upstream and downstream of the transcript boundaries\n", " --w-nocds for -w, disable the output of CDS info in the FASTA file\n", " -x write a fasta file with spliced CDS for each GFF transcript\n", " -y write a protein fasta file with the translation of CDS for each record\n", " -W for -w, -x and -y options, write in the FASTA defline all the exon\n", " coordinates projected onto the spliced sequence;\n", " -S for -y option, use '*' instead of '.' as stop codon translation\n", " -L Ensembl GTF to GFF3 conversion, adds version to IDs\n", " -m is a name mapping table for converting reference \n", " sequence names, having this 2-column format:\n", " \n", " -t use in the 2nd column of each GFF/GTF output line\n", " -o write the output records into instead of stdout\n", " -T main output will be GTF instead of GFF3\n", " --bed output records in BED format instead of default GFF3\n", " --tlf output \"transcript line format\" which is like GFF\n", " but with exons and CDS related features stored as GFF \n", " attributes in the transcript feature line, like this:\n", " exoncount=N;exons=;CDSphase=;CDS= \n", " is a comma-delimited list of exon_start-exon_end coordinates;\n", " is CDS_start:CDS_end coordinates or a list like \n", " --table output a simple tab delimited format instead of GFF, with columns\n", " having the values of GFF attributes given in ; special\n", " pseudo-attributes (prefixed by @) are recognized:\n", " @id, @geneid, @chr, @start, @end, @strand, @numexons, @exons, \n", " @cds, @covlen, @cdslen\n", " If any of -w/-y/-x FASTA output files are enabled, the same fields\n", " (excluding @id) are appended to the definition line of corresponding\n", " FASTA records\n", " -v,-E expose (warn about) duplicate transcript IDs and other potential\n", " problems with the given GFF/GTF records\n" ] }, { "ename": "CalledProcessError", "evalue": "Command 'b'${gffread} -h\\n'' returned non-zero exit status 1.", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mCalledProcessError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/tmp/ipykernel_1145435/1000630337.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mget_ipython\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun_cell_magic\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'bash'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m''\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'${gffread} -h\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/interactiveshell.py\u001b[0m in \u001b[0;36mrun_cell_magic\u001b[0;34m(self, magic_name, line, cell)\u001b[0m\n\u001b[1;32m 2417\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuiltin_trap\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2418\u001b[0m \u001b[0margs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mmagic_arg_s\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2419\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2420\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2421\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magics/script.py\u001b[0m in \u001b[0;36mnamed_script_magic\u001b[0;34m(line, cell)\u001b[0m\n\u001b[1;32m 140\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 141\u001b[0m \u001b[0mline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mscript\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 142\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshebang\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 143\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 144\u001b[0m \u001b[0;31m# write a basic docstring:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/decorator.py\u001b[0m in \u001b[0;36mfun\u001b[0;34m(*args, **kw)\u001b[0m\n\u001b[1;32m 230\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mkwsyntax\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 231\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkw\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfix\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkw\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msig\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 232\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mcaller\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mextras\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 233\u001b[0m \u001b[0mfun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 234\u001b[0m \u001b[0mfun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__doc__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__doc__\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magic.py\u001b[0m in \u001b[0;36m\u001b[0;34m(f, *a, **k)\u001b[0m\n\u001b[1;32m 185\u001b[0m \u001b[0;31m# but it's overkill for just that one bit of state.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 186\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mmagic_deco\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 187\u001b[0;31m \u001b[0mcall\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mlambda\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 188\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 189\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcallable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/programs/miniconda3/envs/gffutils_env/lib/python3.9/site-packages/IPython/core/magics/script.py\u001b[0m in \u001b[0;36mshebang\u001b[0;34m(self, line, cell)\u001b[0m\n\u001b[1;32m 243\u001b[0m \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstderr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mflush\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 244\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mraise_error\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreturncode\u001b[0m\u001b[0;34m!=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 245\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mCalledProcessError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreturncode\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moutput\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstderr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 246\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 247\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_run_script\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mto_close\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mCalledProcessError\u001b[0m: Command 'b'${gffread} -h\\n'' returned non-zero exit status 1." ] } ], "source": [ "%%bash\n", "${gffread} -h" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }