{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Use taxonomic read classifications from MEGAN6 to extract day and treatment taxa-specific NanoPore FastQ reads" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TODAY'S DATE:\n", "Tue 13 Oct 2020 09:49:02 AM PDT\n", "------------\n", "\n", "Distributor ID:\tUbuntu\n", "Description:\tUbuntu 20.04.1 LTS\n", "Release:\t20.04\n", "Codename:\tfocal\n", "\n", "------------\n", "HOSTNAME: \n", "mephisto\n", "\n", "------------\n", "Computer Specs:\n", "\n", "Architecture: x86_64\n", "CPU op-mode(s): 32-bit, 64-bit\n", "Byte Order: Little Endian\n", "Address sizes: 36 bits physical, 48 bits virtual\n", "CPU(s): 4\n", "On-line CPU(s) list: 0-3\n", "Thread(s) per core: 2\n", "Core(s) per socket: 2\n", "Socket(s): 1\n", "NUMA node(s): 1\n", "Vendor ID: GenuineIntel\n", "CPU family: 6\n", "Model: 58\n", "Model name: Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz\n", "Stepping: 9\n", "CPU MHz: 2917.625\n", "CPU max MHz: 3000.0000\n", "CPU min MHz: 800.0000\n", "BogoMIPS: 4789.55\n", "Virtualization: VT-x\n", "L1d cache: 64 KiB\n", "L1i cache: 64 KiB\n", "L2 cache: 512 KiB\n", "L3 cache: 4 MiB\n", "NUMA node0 CPU(s): 0-3\n", "Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages\n", "Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable\n", "Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable\n", "Vulnerability Meltdown: Mitigation; PTI\n", "Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\n", "Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\n", "Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling\n", "Vulnerability Srbds: Vulnerable: No microcode\n", "Vulnerability Tsx async abort: Not affected\n", "Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d\n", "\n", "------------\n", "\n", "Memory Specs\n", "\n", " total used free shared buff/cache available\n", "Mem: 7.5Gi 4.6Gi 665Mi 1.1Gi 2.3Gi 1.6Gi\n", "Swap: 21Gi 1.8Gi 20Gi\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "No LSB modules are available.\n" ] } ], "source": [ "%%bash\n", "echo \"TODAY'S DATE:\"\n", "date\n", "echo \"------------\"\n", "echo \"\"\n", "#Display operating system info\n", "lsb_release -a\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"HOSTNAME: \"; hostname \n", "echo \"\"\n", "echo \"------------\"\n", "echo \"Computer Specs:\"\n", "echo \"\"\n", "lscpu\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"\"\n", "echo \"Memory Specs\"\n", "echo \"\"\n", "free -mh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set variables" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: muscle_fasta_dir=/home/samb/analyses/20201007_cbai_megan-read-extractions_201002558-2729-Q7\n", "env: hemo_fasta_dir=/home/samb/analyses/20201007_cbai_megan-read-extractions_6129-403-26-Q7\n", "env: fastq_dir=/home/samb/data/C_bairdi/DNAseq\n", "env: wd=/home/samb/analyses\n", "env: seqtk=/home/samb/programs/seqtk_1.3-r115/seqtk\n", "env: suffix=megan.fq\n" ] } ], "source": [ "# Set data directories\n", "%env muscle_fasta_dir=/home/samb/analyses/20201007_cbai_megan-read-extractions_201002558-2729-Q7\n", "%env hemo_fasta_dir=/home/samb/analyses/20201007_cbai_megan-read-extractions_6129-403-26-Q7\n", "%env fastq_dir=/home/samb/data/C_bairdi/DNAseq\n", "%env wd=/home/samb/analyses\n", "\n", "# Programs\n", "%env seqtk=/home/samb/programs/seqtk_1.3-r115/seqtk\n", "\n", "# File naming\n", "%env suffix=megan.fq\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Input data are here:\n", "\n", "FastAs: \n", "\n", "- https://gannet.fish.washington.edu/Atumefaciens/20201007_cbai_megan-read-extractions_201002558-2729-Q7/\n", "\n", "- https://gannet.fish.washington.edu/Atumefaciens/20201007_cbai_megan-read-extractions_6129-403-26-Q7/\n", "\n", "FastQs:\n", "\n", "- https://gannet.fish.washington.edu/Atumefaciens/20200928_cbai_nanofilt_Q7_20102558-2729_nanopore-data/20200928_cbai_nanopore_20102558-2729_quality-7.fastq\n", "\n", "- https://gannet.fish.washington.edu/Atumefaciens/20200928_cbai_nanofilt_Q7_6129_403_26_nanopore-data/20200928_cbai_nanopore_6129_403_26_quality-7.fastq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract taxa-specific reads from FastQ files\n", "\n", "Use FastA IDs from MEGAN6 taxonomic read extraction FastAs to pull out appropriate reads from each taxa." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pulling FastA IDs from /home/samb/analyses/20201007_cbai_megan-read-extractions_201002558-2729-Q7/201002558-2729-Q7_summarized-reads-Aquifex_sp..fasta.fai\n", "\n", "Extracting reads from .\n", "\n", "Writing reads to 20201013_201002558-2729-Q7_Aquifex_sp_megan.fq\n", "\n", "\n", "Pulling FastA IDs from /home/samb/analyses/20201007_cbai_megan-read-extractions_201002558-2729-Q7/201002558-2729-Q7_summarized-reads-Arthropoda.fasta.fai\n", "\n", "Extracting reads from /home/samb/data/C_bairdi/DNAseq/20200928_cbai_nanopore_6129_403_26_quality-7.fastq.\n", "\n", "Writing reads to 20201013_201002558-2729-Q7_Arthropoda_megan.fq\n", "\n", "\n", "Pulling FastA IDs from /home/samb/analyses/20201007_cbai_megan-read-extractions_201002558-2729-Q7/201002558-2729-Q7_summarized-reads-Enterospora_canceri.fasta.fai\n", "\n", "Extracting reads from /home/samb/data/C_bairdi/DNAseq/20200928_cbai_nanopore_6129_403_26_quality-7.fastq.\n", "\n", "Writing reads to 20201013_201002558-2729-Q7_Enterospora_canceri_megan.fq\n", "\n", "\n", "Pulling FastA IDs from /home/samb/analyses/20201007_cbai_megan-read-extractions_201002558-2729-Q7/201002558-2729-Q7_summarized-reads-Sar.fasta.fai\n", "\n", "Extracting reads from /home/samb/data/C_bairdi/DNAseq/20200928_cbai_nanopore_6129_403_26_quality-7.fastq.\n", "\n", "Writing reads to 20201013_201002558-2729-Q7_Sar_megan.fq\n", "\n", "\n", "\n", "Done with read extractions\n", "\n", "-------------------------------------\n", "\n", "/home/samb/analyses/20201013_201002558-2729-Q7_megan-reads\n", "total 13M\n", "-rw-rw-r-- 1 samb samb 11K Oct 13 10:23 20201013_201002558-2729-Q7_Aquifex_sp_seqtk-read-id-list\n", "-rw-rw-r-- 1 samb samb 914K Oct 13 10:23 20201013_201002558-2729-Q7_Aquifex_sp_megan.fq\n", "-rw-rw-r-- 1 samb samb 67K Oct 13 10:23 20201013_201002558-2729-Q7_Arthropoda_seqtk-read-id-list\n", "-rw-rw-r-- 1 samb samb 6.8M Oct 13 10:23 20201013_201002558-2729-Q7_Arthropoda_megan.fq\n", "-rw-rw-r-- 1 samb samb 57K Oct 13 10:23 20201013_201002558-2729-Q7_Enterospora_canceri_seqtk-read-id-list\n", "-rw-rw-r-- 1 samb samb 4.9M Oct 13 10:23 20201013_201002558-2729-Q7_Enterospora_canceri_megan.fq\n", "-rw-rw-r-- 1 samb samb 222 Oct 13 10:23 20201013_201002558-2729-Q7_Sar_seqtk-read-id-list\n", "-rw-rw-r-- 1 samb samb 30K Oct 13 10:23 20201013_201002558-2729-Q7_Sar_megan.fq\n", "\n", "-------------------------------------\n", "\n", "Pulling FastA IDs from /home/samb/analyses/20201007_cbai_megan-read-extractions_6129-403-26-Q7/6129-403-26-Q7_summarized-reads-Alveolata.fasta.fai\n", "\n", "Extracting reads from /home/samb/data/C_bairdi/DNAseq/20200928_cbai_nanopore_6129_403_26_quality-7.fastq.\n", "\n", "Writing reads to 20201013_6129-403-26-Q7_Alveolata_megan.fq\n", "\n", "\n", "Pulling FastA IDs from /home/samb/analyses/20201007_cbai_megan-read-extractions_6129-403-26-Q7/6129-403-26-Q7_summarized-reads-Aquifex_sp..fasta.fai\n", "\n", "Extracting reads from /home/samb/data/C_bairdi/DNAseq/20200928_cbai_nanopore_6129_403_26_quality-7.fastq.\n", "\n", "Writing reads to 20201013_6129-403-26-Q7_Aquifex_sp_megan.fq\n", "\n", "\n", "Pulling FastA IDs from /home/samb/analyses/20201007_cbai_megan-read-extractions_6129-403-26-Q7/6129-403-26-Q7_summarized-reads-Arthropoda.fasta.fai\n", "\n", "Extracting reads from /home/samb/data/C_bairdi/DNAseq/20200928_cbai_nanopore_6129_403_26_quality-7.fastq.\n", "\n", "Writing reads to 20201013_6129-403-26-Q7_Arthropoda_megan.fq\n", "\n", "\n", "Pulling FastA IDs from /home/samb/analyses/20201007_cbai_megan-read-extractions_6129-403-26-Q7/6129-403-26-Q7_summarized-reads-Enterospora_canceri.fasta.fai\n", "\n", "Extracting reads from /home/samb/data/C_bairdi/DNAseq/20200928_cbai_nanopore_6129_403_26_quality-7.fastq.\n", "\n", "Writing reads to 20201013_6129-403-26-Q7_Enterospora_canceri_megan.fq\n", "\n", "\n", "\n", "Done with read extractions\n", "\n", "-------------------------------------\n", "\n", "/home/samb/analyses/20201013_6129-403-26-Q7_megan-reads\n", "total 519M\n", "-rw-rw-r-- 1 samb samb 17K Oct 13 10:23 20201013_6129-403-26-Q7_Alveolata_seqtk-read-id-list\n", "-rw-rw-r-- 1 samb samb 3.5M Oct 13 10:24 20201013_6129-403-26-Q7_Alveolata_megan.fq\n", "-rw-rw-r-- 1 samb samb 152K Oct 13 10:24 20201013_6129-403-26-Q7_Aquifex_sp_seqtk-read-id-list\n", "-rw-rw-r-- 1 samb samb 41M Oct 13 10:24 20201013_6129-403-26-Q7_Aquifex_sp_megan.fq\n", "-rw-rw-r-- 1 samb samb 1.1M Oct 13 10:24 20201013_6129-403-26-Q7_Arthropoda_seqtk-read-id-list\n", "-rw-rw-r-- 1 samb samb 311M Oct 13 10:24 20201013_6129-403-26-Q7_Arthropoda_megan.fq\n", "-rw-rw-r-- 1 samb samb 655K Oct 13 10:24 20201013_6129-403-26-Q7_Enterospora_canceri_seqtk-read-id-list\n", "-rw-rw-r-- 1 samb samb 162M Oct 13 10:24 20201013_6129-403-26-Q7_Enterospora_canceri_megan.fq\n", "\n", "-------------------------------------\n", "\n" ] } ], "source": [ "%%bash\n", "\n", "timestamp=$(date +%Y%m%d)\n", "\n", "\n", "for directory in ${muscle_fasta_dir} ${hemo_fasta_dir}\n", "do\n", "\t# Get sample name\n", "\tsample=$(echo \"${directory}\" | cut -d \"_\" -f 4)\n", " \n", " # Make new directory and change to that directory (\"$_\" means use previous command's argument)\n", " mkdir --parents \"${wd}\"/\"${timestamp}\"_\"${sample}\"_megan-reads \\\n", " && cd \"$_\" || exit\n", "\n", "\n", "\t######################################################\n", "\t# Create FastA IDs list to use for sequence extraction\n", "\t######################################################\n", "\tfor fai in \"${directory}\"/*.fai\n", "\tdo\n", " # Get species\n", " if [[ \"${sample}\" = \"201002558-2729-Q7\" ]]; then\n", " species=$(echo \"${fai##*/}\" | awk -F [.-] '{print $5}')\n", " else\n", " species=$(echo \"${fai##*/}\" | awk -F [.-] '{print $6}')\n", " fi\n", " \n", " # Set output FastQ filenames\n", " prefix=${timestamp}_${sample}_${species}\n", "\n", "\t # Set seqtk list filename\n", "\t seqtk_list=${prefix}_seqtk-read-id-list\n", " \n", " echo \"Pulling FastA IDs from ${fai}\"\n", " echo \"\"\n", " \n", " # Parse FastA IDs from FastA index file\n", " awk '{print $1}' \"${fai}\" | sort -u >> \"${seqtk_list}\"\n", " \n", " \n", " echo \"Extracting reads from ${fastq}.\"\n", " echo \"\"\n", " \n", " out=\"${prefix}_${suffix}\"\n", " \n", " for fastq in ${fastq_dir}/*.fastq\n", " do\n", " # Extract corresponding reads using seqtk FastA ID list\n", " \t ${seqtk} subseq \"${fastq}\" \"${seqtk_list}\" >> \"${out}\"\n", " done\n", " \n", " echo \"Writing reads to ${out}\"\n", " echo \"\"\n", " echo \"\"\n", " \n", "\tdone\n", "\n", " \n", " echo \"\"\n", " echo \"Done with read extractions\"\n", " echo \"\"\n", " echo \"-------------------------------------\"\n", " echo \"\"\n", "\n", " # Print working directory and list files\n", " pwd\n", "\tls -ltrh\n", " echo \"\"\n", " echo \"-------------------------------------\"\n", " echo \"\"\n", "done\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }