{ "cells": [ { "cell_type": "markdown", "id": "2949abe4-dfa6-4c4a-a395-a3a1db92b5e7", "metadata": {}, "source": [ "## Identify lncRNA Transcript IDs\n", "\n", "Use [CPC2 Standalone](https://github.com/gao-lab/CPC2_standalone) to idenitfy _P.generosa_ lncRNAs with no coding potential\n", "from the Panopea-generosa-v1.0 annotated GTF generated by [gffcompare](https://ccb.jhu.edu/software/stringtie/gffcompare.shtml).\n", "\n", "Notebook uses an aribtrary minimum length cutoff of 200bp.\n", "\n", "#### Notebook relies on:\n", "\n", "- [CPC2 Standalone](https://github.com/gao-lab/CPC2_standalone)\n", "\n", "- [bedtools getfasta](https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html)\n", "\n" ] }, { "cell_type": "markdown", "id": "ee0ebae6-d54c-4d18-88bd-3d3456a8b1e6", "metadata": {}, "source": [ "### List computer specs" ] }, { "cell_type": "code", "execution_count": 1, "id": "9f60016c-d6b6-4b6d-86d3-5ee68b55464c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TODAY'S DATE:\n", "Tue May 2 06:45:03 AM PDT 2023\n", "------------\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "No LSB modules are available.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Distributor ID:\tUbuntu\n", "Description:\tUbuntu 22.04.2 LTS\n", "Release:\t22.04\n", "Codename:\tjammy\n", "\n", "------------\n", "HOSTNAME: \n", "computer\n", "\n", "------------\n", "Computer Specs:\n", "\n", "Architecture: x86_64\n", "CPU op-mode(s): 32-bit, 64-bit\n", "Address sizes: 45 bits physical, 48 bits virtual\n", "Byte Order: Little Endian\n", "CPU(s): 4\n", "On-line CPU(s) list: 0-3\n", "Vendor ID: GenuineIntel\n", "Model name: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz\n", "CPU family: 6\n", "Model: 165\n", "Thread(s) per core: 1\n", "Core(s) per socket: 1\n", "Socket(s): 4\n", "Stepping: 2\n", "BogoMIPS: 4800.01\n", "Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat flush_l1d arch_capabilities\n", "Hypervisor vendor: VMware\n", "Virtualization type: full\n", "L1d cache: 128 KiB (4 instances)\n", "L1i cache: 128 KiB (4 instances)\n", "L2 cache: 1 MiB (4 instances)\n", "L3 cache: 64 MiB (4 instances)\n", "NUMA node(s): 1\n", "NUMA node0 CPU(s): 0-3\n", "Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported\n", "Vulnerability L1tf: Mitigation; PTE Inversion\n", "Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\n", "Vulnerability Meltdown: Mitigation; PTI\n", "Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\n", "Vulnerability Retbleed: Mitigation; IBRS\n", "Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl\n", "Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\n", "Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected\n", "Vulnerability Srbds: Unknown: Dependent on hypervisor status\n", "Vulnerability Tsx async abort: Not affected\n", "\n", "------------\n", "\n", "Memory Specs\n", "\n", " total used free shared buff/cache available\n", "Mem: 54Gi 6.0Gi 35Gi 322Mi 12Gi 47Gi\n", "Swap: 2.0Gi 0B 2.0Gi\n" ] } ], "source": [ "%%bash\n", "echo \"TODAY'S DATE:\"\n", "date\n", "echo \"------------\"\n", "echo \"\"\n", "#Display operating system info\n", "lsb_release -a\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"HOSTNAME: \"; hostname \n", "echo \"\"\n", "echo \"------------\"\n", "echo \"Computer Specs:\"\n", "echo \"\"\n", "lscpu\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"\"\n", "echo \"Memory Specs\"\n", "echo \"\"\n", "free -mh" ] }, { "cell_type": "markdown", "id": "19866f68-bfad-4a83-adb5-24e271e29d06", "metadata": {}, "source": [ "### Set variables\n", "- `%env` indicates a bash variable\n", "\n", "- without `%env` is Python variable" ] }, { "cell_type": "code", "execution_count": 2, "id": "7293bcb0-581c-4ad2-8f1e-09dd98352aaf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: transcriptomes_dir=/home/sam/data/P_generosa/transcriptomes\n", "env: genomes_dir=/home/sam/data/P_generosa/genomes\n", "env: analysis_dir=/home/sam/analyses/20230502-pgen-lncRNA-identification\n", "env: min_lncRNA_length=200\n", "env: gffcompare_gtf=Panopea-generosa-v1.0-gffcmp.annotated.gtf\n", "env: genome_fasta=Panopea-generosa-v1.0.fasta\n", "env: genome_url=gannet:/volume2/web/Atumefaciens/20191105_swoose_pgen_v074_renaming/Panopea-generosa-v1.0.fa\n", "env: gff_comp_gtf_url=gannet:/volume2/web/Atumefaciens/20230426-pgen-HISAT2-stringtie-gffcompare-RNAseq/gffcompare\n", "env: lncRNAs_gtf=20230502-pgen-lncRNA-IDs.gtf\n", "env: lncRNA_candidates=lncRNA_candidates.gtf\n", "env: lncRNA_candidates_bed=lncRNA_candidates.bed\n", "env: lncRNA_candidates_fasta=lncRNA_candidates.fasta\n", "env: lncRNA_ids=lncRNA-IDs.txt\n", "env: cpc2_table=cpc2_output_table\n", "env: cpc2=/home/sam/programs/CPC2_standalone-1.0.1/bin/CPC2.py\n", "env: bedtools=/home/sam/programs/bedtools-2.29.1/bin/bedtools\n", "env: line=-------------------------------------------------------------------------------------\n" ] } ], "source": [ "# Set directories\n", "%env transcriptomes_dir=/home/sam/data/P_generosa/transcriptomes\n", "%env genomes_dir=/home/sam/data/P_generosa/genomes\n", "%env analysis_dir=/home/sam/analyses/20230502-pgen-lncRNA-identification\n", "analysis_dir=\"20230502-pgen-lncRNA-identification\"\n", "\n", "# Set lncRNA minimum length\n", "%env min_lncRNA_length=200\n", "\n", "# Input files\n", "%env gffcompare_gtf=Panopea-generosa-v1.0-gffcmp.annotated.gtf\n", "%env genome_fasta=Panopea-generosa-v1.0.fasta\n", "# Genome FastA URL\n", "# https://gannet.fish.washington.edu/Atumefaciens/20191105_swoose_pgen_v074_renaming/Panopea-generosa-v1.0.fa\n", "%env genome_url=gannet:/volume2/web/Atumefaciens/20191105_swoose_pgen_v074_renaming/Panopea-generosa-v1.0.fa\n", "\n", "# URL of file directory\n", "# https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-HISAT2-stringtie-gffcompare-RNAseq/gffcompare\n", "%env gff_comp_gtf_url=gannet:/volume2/web/Atumefaciens/20230426-pgen-HISAT2-stringtie-gffcompare-RNAseq/gffcompare\n", "\n", "# Output file(s)\n", "%env lncRNAs_gtf=20230502-pgen-lncRNA-IDs.gtf\n", "%env lncRNA_candidates=lncRNA_candidates.gtf\n", "%env lncRNA_candidates_bed=lncRNA_candidates.bed\n", "%env lncRNA_candidates_fasta=lncRNA_candidates.fasta\n", "%env lncRNA_ids=lncRNA-IDs.txt\n", "%env cpc2_table=cpc2_output_table\n", "\n", "\n", "# Set program locations\n", "%env cpc2=/home/sam/programs/CPC2_standalone-1.0.1/bin/CPC2.py\n", "%env bedtools=/home/sam/programs/bedtools-2.29.1/bin/bedtools\n", "\n", "# Line for formatting\n", "\n", "%env line=-------------------------------------------------------------------------------------" ] }, { "cell_type": "markdown", "id": "7f204c16-2d1f-4837-93b0-1fb0e3d00d64", "metadata": {}, "source": [ "### Create analysis directory" ] }, { "cell_type": "code", "execution_count": 3, "id": "8f275e34-c56e-4754-abf7-3279667434bb", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "# Make analysis and data directory, if doesn't exist\n", "mkdir --parents \"${analysis_dir}\"\n", "\n", "mkdir --parents \"${transcriptomes_dir}\"" ] }, { "cell_type": "markdown", "id": "56052f6d-441a-4048-8a6f-39d58552283d", "metadata": {}, "source": [ "### Download GTF" ] }, { "cell_type": "code", "execution_count": 4, "id": "951fc8e9-b821-4f54-848f-f9573daadc83", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 sam sam 74M Apr 28 15:44 Panopea-generosa-v1.0-gffcmp.annotated.gtf\n" ] } ], "source": [ "%%bash\n", "cd \"${transcriptomes_dir}\"\n", "\n", "rsync \"${gff_comp_gtf_url}/${gffcompare_gtf}\" .\n", "\n", "\n", "ls -ltrh \"${gffcompare_gtf}\"" ] }, { "cell_type": "markdown", "id": "e50df042-59c0-4327-b3a4-f14399eba05f", "metadata": {}, "source": [ "### Inspect GTF" ] }, { "cell_type": "code", "execution_count": 5, "id": "98a870cf-518f-44e5-8c7a-eb32c219ea68", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scaffold_01\tStringTie\ttranscript\t2\t4719\t.\t+\t.\ttranscript_id \"PGEN_.00g000010.m01\"; gene_id \"MSTRG.4\"; gene_name \"PGEN_.00g000010\"; xloc \"XLOC_000001\"; ref_gene_id \"PGEN_.00g000010\"; cmp_ref \"PGEN_.00g000010.m01\"; class_code \"=\"; tss_id \"TSS1\";\n", "Scaffold_01\tStringTie\texon\t2\t125\t.\t+\t.\ttranscript_id \"PGEN_.00g000010.m01\"; gene_id \"MSTRG.4\"; exon_number \"1\";\n", "Scaffold_01\tStringTie\texon\t1995\t2095\t.\t+\t.\ttranscript_id \"PGEN_.00g000010.m01\"; gene_id \"MSTRG.4\"; exon_number \"2\";\n", "Scaffold_01\tStringTie\texon\t3325\t3495\t.\t+\t.\ttranscript_id \"PGEN_.00g000010.m01\"; gene_id \"MSTRG.4\"; exon_number \"3\";\n", "Scaffold_01\tStringTie\texon\t4651\t4719\t.\t+\t.\ttranscript_id \"PGEN_.00g000010.m01\"; gene_id \"MSTRG.4\"; exon_number \"4\";\n", "Scaffold_01\tStringTie\ttranscript\t55792\t67546\t.\t+\t.\ttranscript_id \"PGEN_.00g000040.m01\"; gene_id \"MSTRG.5\"; gene_name \"PGEN_.00g000040\"; xloc \"XLOC_000002\"; ref_gene_id \"PGEN_.00g000040\"; cmp_ref \"PGEN_.00g000040.m01\"; class_code \"=\"; tss_id \"TSS2\";\n", "Scaffold_01\tStringTie\texon\t55792\t55972\t.\t+\t.\ttranscript_id \"PGEN_.00g000040.m01\"; gene_id \"MSTRG.5\"; exon_number \"1\";\n", "Scaffold_01\tStringTie\texon\t58912\t59276\t.\t+\t.\ttranscript_id \"PGEN_.00g000040.m01\"; gene_id \"MSTRG.5\"; exon_number \"2\";\n", "Scaffold_01\tStringTie\texon\t61676\t62119\t.\t+\t.\ttranscript_id \"PGEN_.00g000040.m01\"; gene_id \"MSTRG.5\"; exon_number \"3\";\n", "Scaffold_01\tStringTie\texon\t65608\t65740\t.\t+\t.\ttranscript_id \"PGEN_.00g000040.m01\"; gene_id \"MSTRG.5\"; exon_number \"4\";\n", "\n", "-------------------------------------------------------------------------------------\n", "\n", "Number of lines:\n", "539407 /home/sam/data/P_generosa/transcriptomes/Panopea-generosa-v1.0-gffcmp.annotated.gtf\n", "\n", "-------------------------------------------------------------------------------------\n", "\n", "Number of transcripts in Panopea-generosa-v1.0-gffcmp.annotated.gtf:\n", "79269\n" ] } ], "source": [ "%%bash\n", "head ${transcriptomes_dir}/\"${gffcompare_gtf}\"\n", "\n", "echo \"\"\n", "echo \"${line}\"\n", "echo \"\"\n", "\n", "echo \"Number of lines:\"\n", "wc -l ${transcriptomes_dir}/*.gtf\n", "\n", "echo \"\"\n", "echo \"${line}\"\n", "echo \"\"\n", "\n", "echo \"Number of transcripts in ${gffcompare_gtf}:\"\n", "awk '$3 == \"transcript\" {print}' ${transcriptomes_dir}/\"${gffcompare_gtf}\" | wc -l" ] }, { "cell_type": "markdown", "id": "ec9a917a-f92b-4b47-9132-e74d3e796e23", "metadata": {}, "source": [ "## Download genome FastA" ] }, { "cell_type": "code", "execution_count": 6, "id": "2ed44d78-7fa4-48ed-a180-1c7df083f4ca", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rwxr-xr-x 1 sam sam 914M Nov 5 2019 Panopea-generosa-v1.0.fasta\n" ] } ], "source": [ "%%bash\n", "\n", "cd \"${genomes_dir}\"\n", "\n", "rsync \"${genome_url}\" .\n", "\n", "\n", "ls -ltrh \"${genome_fasta}\"" ] }, { "cell_type": "markdown", "id": "637b3c3c-023c-4a90-8e93-6a0ed2f39e60", "metadata": {}, "source": [ "### Inspect genome FastA" ] }, { "cell_type": "code", "execution_count": 7, "id": "f48f3d35-497d-40c8-94ac-f83e4f45ea92", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of sequences:\n", "18\n" ] } ], "source": [ "%%bash\n", "cd \"${genomes_dir}\"\n", "\n", "echo \"Number of sequences:\"\n", "grep \"^>\" -c \"${genome_fasta}\"" ] }, { "cell_type": "markdown", "id": "ab669ae2-08c9-43cc-8c9a-55d066a23dbb", "metadata": {}, "source": [ "## Parse out transcripts >= 200bp _and_ no overlap with reference transcripts\n", "\n", "string class_code ā€œuā€ indicates no overlap" ] }, { "cell_type": "code", "execution_count": 8, "id": "1c6441da-fc23-4939-b6d2-210ddb851335", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "14076 /home/sam/analyses/20230502-pgen-lncRNA-identification/lncRNA_candidates.gtf\n", "\n", "-------------------------------------------------------------------------------------\n", "\n", "Scaffold_01\tStringTie\ttranscript\t656906\t657583\t.\t+\t.\ttranscript_id \"MSTRG.38.5\"; gene_id \"MSTRG.38\"; xloc \"XLOC_000013\"; class_code \"u\"; tss_id \"TSS19\";\n", "Scaffold_01\tStringTie\ttranscript\t648204\t649326\t.\t+\t.\ttranscript_id \"MSTRG.39.1\"; gene_id \"MSTRG.39\"; xloc \"XLOC_000014\"; class_code \"u\"; tss_id \"TSS20\";\n", "Scaffold_01\tStringTie\ttranscript\t849165\t854552\t.\t+\t.\ttranscript_id \"MSTRG.63.1\"; gene_id \"MSTRG.63\"; xloc \"XLOC_000019\"; class_code \"u\"; tss_id \"TSS25\";\n", "Scaffold_01\tStringTie\ttranscript\t852049\t854552\t.\t+\t.\ttranscript_id \"MSTRG.63.2\"; gene_id \"MSTRG.63\"; xloc \"XLOC_000019\"; class_code \"u\"; tss_id \"TSS26\";\n", "Scaffold_01\tStringTie\ttranscript\t862415\t867481\t.\t+\t.\ttranscript_id \"MSTRG.66.1\"; gene_id \"MSTRG.66\"; xloc \"XLOC_000020\"; class_code \"u\"; tss_id \"TSS27\";\n", "Scaffold_01\tStringTie\ttranscript\t1824775\t1828291\t.\t+\t.\ttranscript_id \"MSTRG.109.1\"; gene_id \"MSTRG.109\"; xloc \"XLOC_000040\"; class_code \"u\"; tss_id \"TSS56\";\n", "Scaffold_01\tStringTie\ttranscript\t1966694\t1970033\t.\t+\t.\ttranscript_id \"MSTRG.121.2\"; gene_id \"MSTRG.121\"; xloc \"XLOC_000044\"; class_code \"u\"; tss_id \"TSS62\";\n", "Scaffold_01\tStringTie\ttranscript\t1966694\t1970033\t.\t+\t.\ttranscript_id \"MSTRG.121.1\"; gene_id \"MSTRG.121\"; xloc \"XLOC_000044\"; class_code \"u\"; tss_id \"TSS62\";\n", "Scaffold_01\tStringTie\ttranscript\t2318317\t2328097\t.\t+\t.\ttranscript_id \"MSTRG.137.3\"; gene_id \"MSTRG.137\"; xloc \"XLOC_000053\"; class_code \"u\"; tss_id \"TSS72\";\n", "Scaffold_01\tStringTie\ttranscript\t2843989\t2844204\t.\t+\t.\ttranscript_id \"MSTRG.169.1\"; gene_id \"MSTRG.169\"; xloc \"XLOC_000070\"; class_code \"u\"; tss_id \"TSS92\";\n" ] } ], "source": [ "%%bash\n", "cd \"${transcriptomes_dir}\"\n", "\n", "awk '$3 == \"transcript\" {print}' \"${gffcompare_gtf}\" | grep 'class_code \"u\"' \\\n", "| awk -v min_lncRNA_length=\"${min_lncRNA_length}\" '$5 - $4 >= min_lncRNA_length {print}' \\\n", "> \"${analysis_dir}/${lncRNA_candidates}\"\n", "\n", "wc -l \"${analysis_dir}/${lncRNA_candidates}\"\n", "\n", "echo \"\"\n", "echo \"${line}\"\n", "echo \"\"\n", "\n", "head \"${analysis_dir}/${lncRNA_candidates}\"" ] }, { "cell_type": "markdown", "id": "2d4d6317-352d-449f-ac29-77ffe9566112", "metadata": {}, "source": [ "## Convert GTF to BED\n", "\n", "This is needed so the subsequent `bedtools getfasta` step will have a unique name for each transcript.\n", "\n", "Otherwise, the generated names generate duplicates because they're based off of the scaffold name and the start/stop sites. As such, when there are multiple transcripts identified for the same locations, they end up with the same names." ] }, { "cell_type": "code", "execution_count": 9, "id": "5d1930f1-4491-4094-974a-9547103dea3d", "metadata": { "tags": [] }, "outputs": [], "source": [ "%%bash\n", "\n", "while read -r line\n", "do\n", " stringtie_transcript=$(echo \"${line}\" | awk -F \"\\\"\" '{print $2}')\n", "\n", " chr=$(echo \"${line}\" | awk '{print $1}')\n", "\n", " start=$(echo \"${line}\" | awk '{print $4}')\n", " \n", " end=$(echo \"${line}\" | awk '{print $5}')\n", "\n", " score=\"0\"\n", " \n", " strand=$(echo \"${line}\" | awk '{print $7}')\n", "\n", " printf \"%s\\t%s\\t%s\\t%s\\t%s\\t%s\\n\" \"${chr}\" \"${start}\" \"${end}\" \"${stringtie_transcript}\" \"${score}\" \"${strand}\"\n", "\n", "done < \"${analysis_dir}/${lncRNA_candidates}\" > \"${analysis_dir}/${lncRNA_candidates_bed}\"" ] }, { "cell_type": "markdown", "id": "ceff8a56-6d78-4a75-ab3b-a1acaffbe370", "metadata": {}, "source": [ "### Inspect BED" ] }, { "cell_type": "code", "execution_count": 10, "id": "e4f3efff-b21a-4493-87a8-3aa553ffb5a7", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scaffold_01\t656906\t657583\tMSTRG.38.5\t0\t+\n", "Scaffold_01\t648204\t649326\tMSTRG.39.1\t0\t+\n", "Scaffold_01\t849165\t854552\tMSTRG.63.1\t0\t+\n", "Scaffold_01\t852049\t854552\tMSTRG.63.2\t0\t+\n", "Scaffold_01\t862415\t867481\tMSTRG.66.1\t0\t+\n", "Scaffold_01\t1824775\t1828291\tMSTRG.109.1\t0\t+\n", "Scaffold_01\t1966694\t1970033\tMSTRG.121.2\t0\t+\n", "Scaffold_01\t1966694\t1970033\tMSTRG.121.1\t0\t+\n", "Scaffold_01\t2318317\t2328097\tMSTRG.137.3\t0\t+\n", "Scaffold_01\t2843989\t2844204\tMSTRG.169.1\t0\t+\n", "\n", "-------------------------------------------------------------------------------------\n", "\n", "14076 /home/sam/analyses/20230502-pgen-lncRNA-identification/lncRNA_candidates.bed\n" ] } ], "source": [ "%%bash\n", "\n", "head \"${analysis_dir}/${lncRNA_candidates_bed}\"\n", "\n", "echo \"\"\n", "echo \"${line}\"\n", "echo \"\"\n", "\n", "wc -l \"${analysis_dir}/${lncRNA_candidates_bed}\"" ] }, { "cell_type": "markdown", "id": "89aaed2d-f163-4c36-9c7d-84c629ea8cb2", "metadata": {}, "source": [ "## Extract FastA sequences from candidate lncRNAs\n", "\n", "Use `bedtools` to extract lncRNA sequences as FastA.\n", "\n", "- `-name` option to use FastA sequence ID and coordinates as the names of the output sequences" ] }, { "cell_type": "code", "execution_count": 11, "id": "99e878aa-b927-4a14-a6fb-775c5ce2a4f2", "metadata": { "tags": [] }, "outputs": [], "source": [ "%%bash\n", "\n", "${bedtools} getfasta -fi \"${genomes_dir}/${genome_fasta}\" \\\n", "-bed \"${analysis_dir}/${lncRNA_candidates_bed}\" \\\n", "-fo \"${analysis_dir}\"/\"${lncRNA_candidates_fasta}\" \\\n", "-name" ] }, { "cell_type": "markdown", "id": "c0bedee1-3c08-4a56-a3d3-3351cbc0e6d7", "metadata": {}, "source": [ "### Inspect lncRNA FastA\n", "\n", "Number of sequences should match number of transcripts from above (`14076`)" ] }, { "cell_type": "code", "execution_count": 12, "id": "0b6922f6-4c85-4e1a-a413-589cc333904c", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of sequences:\n", "14076\n", "\n", "-------------------------------------------------------------------------------------\n", "\n", "Definition lines formatting:\n", ">MSTRG.38.5::Scaffold_01:656906-657583\n", ">MSTRG.39.1::Scaffold_01:648204-649326\n", ">MSTRG.63.1::Scaffold_01:849165-854552\n", ">MSTRG.63.2::Scaffold_01:852049-854552\n", ">MSTRG.66.1::Scaffold_01:862415-867481\n", ">MSTRG.109.1::Scaffold_01:1824775-1828291\n", ">MSTRG.121.2::Scaffold_01:1966694-1970033\n", ">MSTRG.121.1::Scaffold_01:1966694-1970033\n", ">MSTRG.137.3::Scaffold_01:2318317-2328097\n", ">MSTRG.169.1::Scaffold_01:2843989-2844204\n" ] } ], "source": [ "%%bash\n", "\n", "echo \"Number of sequences:\"\n", "grep -c \"^>\" \"${analysis_dir}\"/\"${lncRNA_candidates_fasta}\"\n", "\n", "echo \"\"\n", "echo \"${line}\"\n", "echo \"\"\n", "echo \"Definition lines formatting:\"\n", "grep \"^>\" \"${analysis_dir}\"/\"${lncRNA_candidates_fasta}\" | head" ] }, { "cell_type": "markdown", "id": "1c34faa9-671d-4440-b18d-12ae7f2f60b9", "metadata": {}, "source": [ "## Run CPC2 to identify non-coding RNAs" ] }, { "cell_type": "code", "execution_count": 13, "id": "d3f491a4-177b-434f-91cd-48575d6d7e74", "metadata": { "tags": [] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[INFO] read file '/home/sam/analyses/20230502-pgen-lncRNA-identification/lncRNA_candidates.fasta'\n", "[INFO] Predicting coding potential, please wait ...\n", "[INFO] Running Done!\n", "[INFO] cost time: 10s\n" ] } ], "source": [ "%%bash\n", "\n", "\"${cpc2}\" \\\n", "-i \"${analysis_dir}\"/\"${lncRNA_candidates_fasta}\" \\\n", "-o \"${analysis_dir}/${cpc2_table}\"\n" ] }, { "cell_type": "markdown", "id": "2dc99fc8-5442-4dcc-a73d-8a003593af40", "metadata": {}, "source": [ "### Inspect CPC2 table" ] }, { "cell_type": "code", "execution_count": 14, "id": "aafc4dc7-56c9-4f44-9d9c-0bf821b0f67a", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#ID\ttranscript_length\tpeptide_length\tFickett_score\tpI\tORF_integrity\tcoding_probability\tlabel\n", "MSTRG.38.5::Scaffold_01:656906-657583\t677\t72\t0.30393\t8.63104343414307\t1\t0.0370878\tnoncoding\n", "MSTRG.39.1::Scaffold_01:648204-649326\t1122\t48\t0.32127\t10.358344841003415\t1\t0.0232355\tnoncoding\n", "MSTRG.63.1::Scaffold_01:849165-854552\t5387\t64\t0.30411\t7.6276449203491214\t1\t0.0265552\tnoncoding\n", "MSTRG.63.2::Scaffold_01:852049-854552\t2503\t54\t0.32878\t6.805923652648925\t1\t0.0188306\tnoncoding\n", "MSTRG.66.1::Scaffold_01:862415-867481\t5066\t98\t0.26407\t9.644486427307129\t1\t0.0769712\tnoncoding\n", "MSTRG.109.1::Scaffold_01:1824775-1828291\t3516\t69\t0.22328\t10.963251686096193\t1\t0.0907481\tnoncoding\n", "MSTRG.121.2::Scaffold_01:1966694-1970033\t3339\t51\t0.23537\t9.184117698669432\t1\t0.0353791\tnoncoding\n", "MSTRG.121.1::Scaffold_01:1966694-1970033\t3339\t51\t0.23537\t9.184117698669432\t1\t0.0353791\tnoncoding\n", "MSTRG.137.3::Scaffold_01:2318317-2328097\t9780\t85\t0.23588\t8.743476295471194\t1\t0.0716077\tnoncoding\n", "\n", "-------------------------------------------------------------------------------------\n", "\n", "Number of entries:\n", "14076\n", "\n", "-------------------------------------------------------------------------------------\n", "\n", "Number of columns in cpc2_output_table.txt:\n", "8\n" ] } ], "source": [ "%%bash\n", "\n", "head \"${analysis_dir}/${cpc2_table}.txt\"\n", "\n", "echo \"\"\n", "echo \"${line}\"\n", "echo \"\"\n", "\n", "echo \"Number of entries:\"\n", "\n", "# Skips header line\n", "tail -n +2 \"${analysis_dir}/${cpc2_table}.txt\" | wc -l\n", "\n", "# Count number of columns\n", "echo \"\"\n", "echo \"${line}\"\n", "echo \"\"\n", "echo \"Number of columns in ${cpc2_table}.txt:\"\n", "awk '{print NF}' \"${analysis_dir}/${cpc2_table}.txt\" | sort --unique\n" ] }, { "cell_type": "markdown", "id": "9af6b2fb-09c5-493a-8762-6a110c5acde6", "metadata": {}, "source": [ "## Capture noncoding IDs\n", "\n", "The `label` column, which is column 8 (`$8`), will be used to pull out all lncRNA IDs." ] }, { "cell_type": "code", "execution_count": 15, "id": "092394e7-eb42-45a9-bfce-938e92a59cdb", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "13606 /home/sam/analyses/20230502-pgen-lncRNA-identification/lncRNA-IDs.txt\n", "\n", "MSTRG.38.5\n", "MSTRG.39.1\n", "MSTRG.63.1\n", "MSTRG.63.2\n", "MSTRG.66.1\n", "MSTRG.109.1\n", "MSTRG.121.2\n", "MSTRG.121.1\n", "MSTRG.137.3\n", "MSTRG.169.1\n" ] } ], "source": [ "%%bash\n", "\n", "awk '$8 == \"noncoding\" {print $1}' \"${analysis_dir}/${cpc2_table}.txt\" |\n", "awk -F\":\" '{print $1}' \\\n", "> \"${analysis_dir}/${lncRNA_ids}\"\n", "\n", "wc -l \"${analysis_dir}/${lncRNA_ids}\"\n", "\n", "echo \"\"\n", "head \"${analysis_dir}/${lncRNA_ids}\"" ] }, { "cell_type": "markdown", "id": "8f63acaf-b014-48a8-8d3e-510434521657", "metadata": {}, "source": [ "## Extract lncRNAs from GTF" ] }, { "cell_type": "code", "execution_count": 16, "id": "7830d28b-b094-4b59-845c-98d18681711b", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "13606 /home/sam/analyses/20230502-pgen-lncRNA-identification/20230502-pgen-lncRNA-IDs.gtf\n" ] } ], "source": [ "%%bash\n", "grep --fixed-strings --file=\"${analysis_dir}/${lncRNA_ids}\" \"${analysis_dir}/${lncRNA_candidates}\" \\\n", "> \"${analysis_dir}/${lncRNAs_gtf}\"\n", "\n", "wc -l \"${analysis_dir}/${lncRNAs_gtf}\"" ] }, { "cell_type": "markdown", "id": "ebe8b657-fdd1-4514-941a-d0550ad3c7fd", "metadata": {}, "source": [ "### Inspect lncRNA GTF" ] }, { "cell_type": "code", "execution_count": 17, "id": "d61e2288-ba34-4df5-8101-e1ace6e3f230", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scaffold_01\tStringTie\ttranscript\t656906\t657583\t.\t+\t.\ttranscript_id \"MSTRG.38.5\"; gene_id \"MSTRG.38\"; xloc \"XLOC_000013\"; class_code \"u\"; tss_id \"TSS19\";\n", "Scaffold_01\tStringTie\ttranscript\t648204\t649326\t.\t+\t.\ttranscript_id \"MSTRG.39.1\"; gene_id \"MSTRG.39\"; xloc \"XLOC_000014\"; class_code \"u\"; tss_id \"TSS20\";\n", "Scaffold_01\tStringTie\ttranscript\t849165\t854552\t.\t+\t.\ttranscript_id \"MSTRG.63.1\"; gene_id \"MSTRG.63\"; xloc \"XLOC_000019\"; class_code \"u\"; tss_id \"TSS25\";\n", "Scaffold_01\tStringTie\ttranscript\t852049\t854552\t.\t+\t.\ttranscript_id \"MSTRG.63.2\"; gene_id \"MSTRG.63\"; xloc \"XLOC_000019\"; class_code \"u\"; tss_id \"TSS26\";\n", "Scaffold_01\tStringTie\ttranscript\t862415\t867481\t.\t+\t.\ttranscript_id \"MSTRG.66.1\"; gene_id \"MSTRG.66\"; xloc \"XLOC_000020\"; class_code \"u\"; tss_id \"TSS27\";\n", "Scaffold_01\tStringTie\ttranscript\t1824775\t1828291\t.\t+\t.\ttranscript_id \"MSTRG.109.1\"; gene_id \"MSTRG.109\"; xloc \"XLOC_000040\"; class_code \"u\"; tss_id \"TSS56\";\n", "Scaffold_01\tStringTie\ttranscript\t1966694\t1970033\t.\t+\t.\ttranscript_id \"MSTRG.121.2\"; gene_id \"MSTRG.121\"; xloc \"XLOC_000044\"; class_code \"u\"; tss_id \"TSS62\";\n", "Scaffold_01\tStringTie\ttranscript\t1966694\t1970033\t.\t+\t.\ttranscript_id \"MSTRG.121.1\"; gene_id \"MSTRG.121\"; xloc \"XLOC_000044\"; class_code \"u\"; tss_id \"TSS62\";\n", "Scaffold_01\tStringTie\ttranscript\t2318317\t2328097\t.\t+\t.\ttranscript_id \"MSTRG.137.3\"; gene_id \"MSTRG.137\"; xloc \"XLOC_000053\"; class_code \"u\"; tss_id \"TSS72\";\n", "Scaffold_01\tStringTie\ttranscript\t2843989\t2844204\t.\t+\t.\ttranscript_id \"MSTRG.169.1\"; gene_id \"MSTRG.169\"; xloc \"XLOC_000070\"; class_code \"u\"; tss_id \"TSS92\";\n" ] } ], "source": [ "%%bash\n", "\n", "head \"${analysis_dir}/${lncRNAs_gtf}\"" ] }, { "cell_type": "markdown", "id": "26445ae7-6472-4814-bde2-33c646ffcf22", "metadata": { "tags": [] }, "source": [ "### Generate checksum(s)" ] }, { "cell_type": "code", "execution_count": 18, "id": "9a8e9dcb-eb74-4af4-8924-0553fe6f3596", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "9adb7efc18fe1bfedcad24c86da1161f 20230502-pgen-lncRNA-IDs.gtf\n", "4978cedf268a87715a9b9b41d94461e3 cpc2_output_table.txt\n", "59c3bb6819262856eb1a154de115f172 lncRNA_candidates.bed\n", "712c72cb22be748babc44bff9fc5704a lncRNA_candidates.fasta\n", "a01efb7f2322802da94d4c038712230a lncRNA_candidates.gtf\n", "6bd15d3625538c2c1d47f7219deddeed lncRNA-IDs.txt\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "\n", "for file in *\n", "do\n", " md5sum \"${file}\" | tee --append checksums.md5\n", "done" ] }, { "cell_type": "markdown", "id": "3eed5b68-470e-4b9a-af97-9ba2f27a51b5", "metadata": {}, "source": [ "### Document GffRead program options" ] }, { "cell_type": "code", "execution_count": 19, "id": "61be6360-3e83-4cd9-a1db-4554995b8771", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: CPC2.py [options] -i input.fasta -o output_file\n", "\n", "Contact: Kang Yujian \n", "\n", "Options:\n", " --version show program's version number and exit\n", " -h, --help show this help message and exit\n", "\n", " Common Options:\n", " -i FILE input sequence in fasta format [Required]\n", " -o FILE output file [Default: cpc2output.txt]\n", " -r also check the reverse strand [Default: FALSE]\n", " --ORF output the start position of longest ORF [Default: FALSE]\n" ] } ], "source": [ "%%bash\n", "${cpc2} -h" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.3" } }, "nbformat": 4, "nbformat_minor": 5 }