{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Use taxonomic read classifications from MEGAN6 to extract day and treatment Phylum-specific FastQs from Arthropoda and Alveolata" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TODAY'S DATE:\n", "Sun Apr 19 13:42:23 PDT 2020\n", "------------\n", "\n", "Distributor ID:\tUbuntu\n", "Description:\tUbuntu 16.04.6 LTS\n", "Release:\t16.04\n", "Codename:\txenial\n", "\n", "------------\n", "HOSTNAME: \n", "swoose\n", "\n", "------------\n", "Computer Specs:\n", "\n", "Architecture: x86_64\n", "CPU op-mode(s): 32-bit, 64-bit\n", "Byte Order: Little Endian\n", "CPU(s): 24\n", "On-line CPU(s) list: 0-23\n", "Thread(s) per core: 2\n", "Core(s) per socket: 6\n", "Socket(s): 2\n", "NUMA node(s): 1\n", "Vendor ID: GenuineIntel\n", "CPU family: 6\n", "Model: 44\n", "Model name: Intel(R) Xeon(R) CPU X5670 @ 2.93GHz\n", "Stepping: 2\n", "CPU MHz: 2925.931\n", "BogoMIPS: 5851.96\n", "Virtualization: VT-x\n", "L1d cache: 32K\n", "L1i cache: 32K\n", "L2 cache: 256K\n", "L3 cache: 12288K\n", "NUMA node0 CPU(s): 0-23\n", "Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid dtherm ida arat flush_l1d\n", "\n", "------------\n", "\n", "Memory Specs\n", "\n", " total used free shared buff/cache available\n", "Mem: 70G 2.1G 2.3G 37M 66G 67G\n", "Swap: 4.7G 1.7G 2.9G\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "No LSB modules are available.\n" ] } ], "source": [ "%%bash\n", "echo \"TODAY'S DATE:\"\n", "date\n", "echo \"------------\"\n", "echo \"\"\n", "#Display operating system info\n", "lsb_release -a\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"HOSTNAME: \"; hostname \n", "echo \"\"\n", "echo \"------------\"\n", "echo \"Computer Specs:\"\n", "echo \"\"\n", "lscpu\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"\"\n", "echo \"Memory Specs\"\n", "echo \"\"\n", "free -mh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set variables" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: crab_data=/home/sam/data/C_bairdi/RNAseq\n", "env: hemat_data=/home/sam/data/Hematodinium/RNAseq\n", "env: wd=/home/sam/analyses\n", "env: line=------------------------------------------------------------------------\n", "env: seqtk=/home/sam/programs/seqtk-1.3/seqtk\n" ] } ], "source": [ "# Set data directories\n", "%env crab_data=/home/sam/data/C_bairdi/RNAseq\n", "%env hemat_data=/home/sam/data/Hematodinium/RNAseq\n", "%env wd=/home/sam/analyses\n", "%env line=------------------------------------------------------------------------\n", "\n", "# Programs\n", "%env seqtk=/home/sam/programs/seqtk-1.3/seqtk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Input data are here:\n", "\n", "FastAs: https://gannet.fish.washington.edu/Atumefaciens/20200419_cbai_MEGAN_read_extractions/\n", "\n", "Trimmed-FastQs: https://gannet.fish.washington.edu/Atumefaciens/20200414_cbai_RNAseq_fastp_trimming/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract phylum-specific reads from trimmed FastQ files\n", "\n", "\n", "Use FastA IDs from MEGAN6 taxonomic read extraction FastAs to pull out appropriate reads from each phylum (Arthropoda and Alveolata). This is performed because MEGAN6 strips paired read ID after the first space. As such, the resulting read extractions using MEGAN end up with a FastA file containing two reads with identicial headers. Not sure if this will cause any downstream issues (i.e. with Trinity) where paired end data is used, so playing it safe and using the truncated IDs to pull FastQs with complete sequence headers for use in subsequent data wrangling." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Printing array values:\n", "\n", "[380823]=infected,D12,cold\n", "[380822]=uninfected,D12,cold\n", "[380821]=infected,D9,ambient\n", "[380820]=uninfected,D9,ambient\n", "[380825]=infected,D12,warm\n", "[380824]=uninfected,D12,warm\n", "\n", "------------------------------------------------------------------------\n", "\n", "\n", "Finished with FastA ID extraction.\n", "\n", "Moving on to read extractions...\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380820_S1_L001_R1_001.fastp-trim.202004143431.fq.gz.\n", "\n", "Writing R1 reads to 20200419.C_bairdi.380820.D9.uninfected.ambient.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380820_S1_L002_R1_001.fastp-trim.202004143700.fq.gz.\n", "\n", "Writing R1 reads to 20200419.C_bairdi.380820.D9.uninfected.ambient.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380821_S2_L001_R1_001.fastp-trim.202004143925.fq.gz.\n", "\n", "Writing R1 reads to 20200419.C_bairdi.380821.D9.infected.ambient.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380821_S2_L002_R1_001.fastp-trim.202004144145.fq.gz.\n", "\n", "Writing R1 reads to 20200419.C_bairdi.380821.D9.infected.ambient.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380822_S3_L001_R1_001.fastp-trim.202004144409.fq.gz.\n", "\n", "Writing R1 reads to 20200419.C_bairdi.380822.D12.uninfected.cold.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380822_S3_L002_R1_001.fastp-trim.202004144633.fq.gz.\n", "\n", "Writing R1 reads to 20200419.C_bairdi.380822.D12.uninfected.cold.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380823_S4_L001_R1_001.fastp-trim.202004144852.fq.gz.\n", "\n", "Writing R1 reads to 20200419.C_bairdi.380823.D12.infected.cold.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380823_S4_L002_R1_001.fastp-trim.202004145106.fq.gz.\n", "\n", "Writing R1 reads to 20200419.C_bairdi.380823.D12.infected.cold.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380824_S5_L001_R1_001.fastp-trim.202004145320.fq.gz.\n", "\n", "Writing R1 reads to 20200419.C_bairdi.380824.D12.uninfected.warm.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380824_S5_L002_R1_001.fastp-trim.202004145558.fq.gz.\n", "\n", "Writing R1 reads to 20200419.C_bairdi.380824.D12.uninfected.warm.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380825_S6_L001_R1_001.fastp-trim.202004145835.fq.gz.\n", "\n", "Writing R1 reads to 20200419.C_bairdi.380825.D12.infected.warm.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380825_S6_L002_R1_001.fastp-trim.202004140109.fq.gz.\n", "\n", "Writing R1 reads to 20200419.C_bairdi.380825.D12.infected.warm.megan_R1.fq\n", "\n", "\n", "\n", "Done with R1 read extractions\n", "\n", "------------------------------------------------------------------------\n", "\n", "Extracting R2 reads from /home/sam/data/C_bairdi/RNAseq/380820_S1_L001_R2_001.fastp-trim.202004143431.fq.gz.\n", "\n", "Writing R2 reads to 20200419.C_bairdi.380820.D9.uninfected.ambient.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/C_bairdi/RNAseq/380820_S1_L002_R2_001.fastp-trim.202004143700.fq.gz.\n", "\n", "Writing R2 reads to 20200419.C_bairdi.380820.D9.uninfected.ambient.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/C_bairdi/RNAseq/380821_S2_L001_R2_001.fastp-trim.202004143925.fq.gz.\n", "\n", "Writing R2 reads to 20200419.C_bairdi.380821.D9.infected.ambient.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/C_bairdi/RNAseq/380821_S2_L002_R2_001.fastp-trim.202004144145.fq.gz.\n", "\n", "Writing R2 reads to 20200419.C_bairdi.380821.D9.infected.ambient.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/C_bairdi/RNAseq/380822_S3_L001_R2_001.fastp-trim.202004144409.fq.gz.\n", "\n", "Writing R2 reads to 20200419.C_bairdi.380822.D12.uninfected.cold.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/C_bairdi/RNAseq/380822_S3_L002_R2_001.fastp-trim.202004144633.fq.gz.\n", "\n", "Writing R2 reads to 20200419.C_bairdi.380822.D12.uninfected.cold.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/C_bairdi/RNAseq/380823_S4_L001_R2_001.fastp-trim.202004144852.fq.gz.\n", "\n", "Writing R2 reads to 20200419.C_bairdi.380823.D12.infected.cold.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/C_bairdi/RNAseq/380823_S4_L002_R2_001.fastp-trim.202004145106.fq.gz.\n", "\n", "Writing R2 reads to 20200419.C_bairdi.380823.D12.infected.cold.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/C_bairdi/RNAseq/380824_S5_L001_R2_001.fastp-trim.202004145320.fq.gz.\n", "\n", "Writing R2 reads to 20200419.C_bairdi.380824.D12.uninfected.warm.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/C_bairdi/RNAseq/380824_S5_L002_R2_001.fastp-trim.202004145558.fq.gz.\n", "\n", "Writing R2 reads to 20200419.C_bairdi.380824.D12.uninfected.warm.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/C_bairdi/RNAseq/380825_S6_L001_R2_001.fastp-trim.202004145835.fq.gz.\n", "\n", "Writing R2 reads to 20200419.C_bairdi.380825.D12.infected.warm.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/C_bairdi/RNAseq/380825_S6_L002_R2_001.fastp-trim.202004140109.fq.gz.\n", "\n", "Writing R2 reads to 20200419.C_bairdi.380825.D12.infected.warm.megan_R2.fq\n", "\n", "\n", "------------------------------------------------------------------------\n", "\n", "/home/sam/analyses/20200419_C_bairdi_megan_reads\n", "total 4.9G\n", "-rw-rw-r-- 1 sam sam 395M Jun 15 08:31 20200419.C_bairdi.380820.D9.uninfected.ambient.megan_R1.fq\n", "-rw-rw-r-- 1 sam sam 395M Jun 15 08:37 20200419.C_bairdi.380820.D9.uninfected.ambient.megan_R2.fq\n", "-rw-rw-r-- 1 sam sam 370M Jun 15 08:32 20200419.C_bairdi.380821.D9.infected.ambient.megan_R1.fq\n", "-rw-rw-r-- 1 sam sam 370M Jun 15 08:38 20200419.C_bairdi.380821.D9.infected.ambient.megan_R2.fq\n", "-rw-rw-r-- 1 sam sam 380M Jun 15 08:33 20200419.C_bairdi.380822.D12.uninfected.cold.megan_R1.fq\n", "-rw-rw-r-- 1 sam sam 380M Jun 15 08:39 20200419.C_bairdi.380822.D12.uninfected.cold.megan_R2.fq\n", "-rw-rw-r-- 1 sam sam 345M Jun 15 08:34 20200419.C_bairdi.380823.D12.infected.cold.megan_R1.fq\n", "-rw-rw-r-- 1 sam sam 345M Jun 15 08:40 20200419.C_bairdi.380823.D12.infected.cold.megan_R2.fq\n", "-rw-rw-r-- 1 sam sam 444M Jun 15 08:35 20200419.C_bairdi.380824.D12.uninfected.warm.megan_R1.fq\n", "-rw-rw-r-- 1 sam sam 444M Jun 15 08:41 20200419.C_bairdi.380824.D12.uninfected.warm.megan_R2.fq\n", "-rw-rw-r-- 1 sam sam 408M Jun 15 08:36 20200419.C_bairdi.380825.D12.infected.warm.megan_R1.fq\n", "-rw-rw-r-- 1 sam sam 408M Jun 15 08:42 20200419.C_bairdi.380825.D12.infected.warm.megan_R2.fq\n", "-rw-rw-r-- 1 sam sam 339M Jun 15 08:30 20200419.C_bairdi.seqtk.read_id.list\n", "-rw-rw-r-- 1 sam sam 450 Jun 15 08:27 fasta-list.txt\n", "\n", "------------------------------------------------------------------------\n", "\n", "\n", "Finished with FastA ID extraction.\n", "\n", "Moving on to read extractions...\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/Hematodinium/RNAseq/380820_S1_L001_R1_001.fastp-trim.202004143431.fq.gz.\n", "\n", "Writing R1 reads to 20200419.Hematodinium.380820.D9.uninfected.ambient.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/Hematodinium/RNAseq/380820_S1_L002_R1_001.fastp-trim.202004143700.fq.gz.\n", "\n", "Writing R1 reads to 20200419.Hematodinium.380820.D9.uninfected.ambient.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/Hematodinium/RNAseq/380821_S2_L001_R1_001.fastp-trim.202004143925.fq.gz.\n", "\n", "Writing R1 reads to 20200419.Hematodinium.380821.D9.infected.ambient.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/Hematodinium/RNAseq/380821_S2_L002_R1_001.fastp-trim.202004144145.fq.gz.\n", "\n", "Writing R1 reads to 20200419.Hematodinium.380821.D9.infected.ambient.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/Hematodinium/RNAseq/380822_S3_L001_R1_001.fastp-trim.202004144409.fq.gz.\n", "\n", "Writing R1 reads to 20200419.Hematodinium.380822.D12.uninfected.cold.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/Hematodinium/RNAseq/380822_S3_L002_R1_001.fastp-trim.202004144633.fq.gz.\n", "\n", "Writing R1 reads to 20200419.Hematodinium.380822.D12.uninfected.cold.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/Hematodinium/RNAseq/380823_S4_L001_R1_001.fastp-trim.202004144852.fq.gz.\n", "\n", "Writing R1 reads to 20200419.Hematodinium.380823.D12.infected.cold.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/Hematodinium/RNAseq/380823_S4_L002_R1_001.fastp-trim.202004145106.fq.gz.\n", "\n", "Writing R1 reads to 20200419.Hematodinium.380823.D12.infected.cold.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/Hematodinium/RNAseq/380824_S5_L001_R1_001.fastp-trim.202004145320.fq.gz.\n", "\n", "Writing R1 reads to 20200419.Hematodinium.380824.D12.uninfected.warm.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/Hematodinium/RNAseq/380824_S5_L002_R1_001.fastp-trim.202004145558.fq.gz.\n", "\n", "Writing R1 reads to 20200419.Hematodinium.380824.D12.uninfected.warm.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/Hematodinium/RNAseq/380825_S6_L001_R1_001.fastp-trim.202004145835.fq.gz.\n", "\n", "Writing R1 reads to 20200419.Hematodinium.380825.D12.infected.warm.megan_R1.fq\n", "\n", "\n", "Extracting R1 reads from /home/sam/data/Hematodinium/RNAseq/380825_S6_L002_R1_001.fastp-trim.202004140109.fq.gz.\n", "\n", "Writing R1 reads to 20200419.Hematodinium.380825.D12.infected.warm.megan_R1.fq\n", "\n", "\n", "\n", "Done with R1 read extractions\n", "\n", "------------------------------------------------------------------------\n", "\n", "Extracting R2 reads from /home/sam/data/Hematodinium/RNAseq/380820_S1_L001_R2_001.fastp-trim.202004143431.fq.gz.\n", "\n", "Writing R2 reads to 20200419.Hematodinium.380820.D9.uninfected.ambient.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/Hematodinium/RNAseq/380820_S1_L002_R2_001.fastp-trim.202004143700.fq.gz.\n", "\n", "Writing R2 reads to 20200419.Hematodinium.380820.D9.uninfected.ambient.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/Hematodinium/RNAseq/380821_S2_L001_R2_001.fastp-trim.202004143925.fq.gz.\n", "\n", "Writing R2 reads to 20200419.Hematodinium.380821.D9.infected.ambient.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/Hematodinium/RNAseq/380821_S2_L002_R2_001.fastp-trim.202004144145.fq.gz.\n", "\n", "Writing R2 reads to 20200419.Hematodinium.380821.D9.infected.ambient.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/Hematodinium/RNAseq/380822_S3_L001_R2_001.fastp-trim.202004144409.fq.gz.\n", "\n", "Writing R2 reads to 20200419.Hematodinium.380822.D12.uninfected.cold.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/Hematodinium/RNAseq/380822_S3_L002_R2_001.fastp-trim.202004144633.fq.gz.\n", "\n", "Writing R2 reads to 20200419.Hematodinium.380822.D12.uninfected.cold.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/Hematodinium/RNAseq/380823_S4_L001_R2_001.fastp-trim.202004144852.fq.gz.\n", "\n", "Writing R2 reads to 20200419.Hematodinium.380823.D12.infected.cold.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/Hematodinium/RNAseq/380823_S4_L002_R2_001.fastp-trim.202004145106.fq.gz.\n", "\n", "Writing R2 reads to 20200419.Hematodinium.380823.D12.infected.cold.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/Hematodinium/RNAseq/380824_S5_L001_R2_001.fastp-trim.202004145320.fq.gz.\n", "\n", "Writing R2 reads to 20200419.Hematodinium.380824.D12.uninfected.warm.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/Hematodinium/RNAseq/380824_S5_L002_R2_001.fastp-trim.202004145558.fq.gz.\n", "\n", "Writing R2 reads to 20200419.Hematodinium.380824.D12.uninfected.warm.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/Hematodinium/RNAseq/380825_S6_L001_R2_001.fastp-trim.202004145835.fq.gz.\n", "\n", "Writing R2 reads to 20200419.Hematodinium.380825.D12.infected.warm.megan_R2.fq\n", "\n", "\n", "Extracting R2 reads from /home/sam/data/Hematodinium/RNAseq/380825_S6_L002_R2_001.fastp-trim.202004140109.fq.gz.\n", "\n", "Writing R2 reads to 20200419.Hematodinium.380825.D12.infected.warm.megan_R2.fq\n", "\n", "\n", "------------------------------------------------------------------------\n", "\n", "/home/sam/analyses/20200419_Hematodinium_megan_reads\n", "total 40M\n", "-rw-rw-r-- 1 sam sam 0 Jun 15 08:42 20200419.Hematodinium.380820.D9.uninfected.ambient.megan_R1.fq\n", "-rw-rw-r-- 1 sam sam 0 Jun 15 08:45 20200419.Hematodinium.380820.D9.uninfected.ambient.megan_R2.fq\n", "-rw-rw-r-- 1 sam sam 2.5M Jun 15 08:43 20200419.Hematodinium.380821.D9.infected.ambient.megan_R1.fq\n", "-rw-rw-r-- 1 sam sam 2.5M Jun 15 08:46 20200419.Hematodinium.380821.D9.infected.ambient.megan_R2.fq\n", "-rw-rw-r-- 1 sam sam 0 Jun 15 08:43 20200419.Hematodinium.380822.D12.uninfected.cold.megan_R1.fq\n", "-rw-rw-r-- 1 sam sam 0 Jun 15 08:46 20200419.Hematodinium.380822.D12.uninfected.cold.megan_R2.fq\n", "-rw-rw-r-- 1 sam sam 4.8M Jun 15 08:44 20200419.Hematodinium.380823.D12.infected.cold.megan_R1.fq\n", "-rw-rw-r-- 1 sam sam 4.8M Jun 15 08:47 20200419.Hematodinium.380823.D12.infected.cold.megan_R2.fq\n", "-rw-rw-r-- 1 sam sam 0 Jun 15 08:44 20200419.Hematodinium.380824.D12.uninfected.warm.megan_R1.fq\n", "-rw-rw-r-- 1 sam sam 0 Jun 15 08:47 20200419.Hematodinium.380824.D12.uninfected.warm.megan_R2.fq\n", "-rw-rw-r-- 1 sam sam 12M Jun 15 08:45 20200419.Hematodinium.380825.D12.infected.warm.megan_R1.fq\n", "-rw-rw-r-- 1 sam sam 12M Jun 15 08:48 20200419.Hematodinium.380825.D12.infected.warm.megan_R2.fq\n", "-rw-rw-r-- 1 sam sam 2.7M Jun 15 08:42 20200419.Hematodinium.seqtk.read_id.list\n", "-rw-rw-r-- 1 sam sam 234 Jun 15 08:42 fasta-list.txt\n", "\n", "------------------------------------------------------------------------\n", "\n", "\n", "------------------------------------------------------------------------\n", "\n", "Total runtime was: 1281 seconds\n" ] } ], "source": [ "%%bash\n", "# Capture start \"time\"\n", "# Uses builtin bash variable called ${SECONDS}\n", "start=${SECONDS}\n", "\n", "# Set timestamp\n", "timestamp=20200419\n", "\n", "# Create associative arrays\n", "# NOTE: These will require Bash >4.0\n", "\n", "## Infection status\n", "declare -A inf_status_array\n", "## Sampling day\n", "declare -A sample_day_array\n", "## Sample temperature array\n", "declare -A sample_temp_array\n", "\n", "# Create sample list\n", "{\n", "echo \"380820_D9_uninfected_ambient\"\n", "echo \"380821_D9_infected_ambient\"\n", "echo \"380822_D12_uninfected_cold\"\n", "echo \"380823_D12_infected_cold\"\n", "echo \"380824_D12_uninfected_warm\"\n", "echo \"380825_D12_infected_warm\"\n", "} >> crab-sample-list.txt\n", "\n", "\n", "# Populate arrays\n", "# Uses underscore as internal field separator (IFS)\n", "# Reads each field in as a variable name (e.g. sampl)\n", "while IFS=\"_\" read -r sample day infection temp\n", "do\n", " inf_status_array[$sample]=$infection\n", " sample_day_array[$sample]=$day\n", " sample_temp_array[$sample]=$temp\n", "done < crab-sample-list.txt\n", "\n", "# Remove the sample list file\n", "rm crab-sample-list.txt\n", "\n", "# Check the arrays\n", "\n", "echo \"Printing array values:\"\n", "echo \"\"\n", "\n", "for key in \"${!inf_status_array[@]}\"\n", "do\n", " printf \"[%s]=%s,%s,%s\\n\" \\\n", " \"$key\" \\\n", " \"${inf_status_array[$key]}\" \\\n", " \"${sample_day_array[$key]}\" \\\n", " \"${sample_temp_array[$key]}\"\n", "done\n", "\n", "echo \"\"\n", "echo \"${line}\"\n", "echo \"\"\n", "\n", "for directory in ${crab_data} ${hemat_data}\n", "do\n", "\t# Get species name\n", "\tspecies=$(echo \"${directory}\" | awk -F\"/\" '{print $5}')\n", " \n", " # Make new directory and change to that directory (\"$_\" means use previous command's argument)\n", " mkdir --parents \"${wd}\"/\"${timestamp}\"_\"${species}\"_megan_reads \\\n", " && cd \"$_\" || exit\n", "\n", "\t# Set seqtk list filename\n", "\tseqtk_list=${timestamp}.${species}.seqtk.read_id.list\n", "\n", "\t# Set output FastQ filenames\n", " prefix=${timestamp}.${species}\n", " R1_suffix=megan_R1.fq\n", " R2_suffix=megan_R2.fq\n", "\n", "\t######################################################\n", "\t# Create FastA IDs list to use for sequence extraction\n", "\t######################################################\n", " for fasta in \"${directory}\"/*.fasta\n", "\tdo\n", " echo \"${fasta}\" >> fasta-list.txt\n", " done\n", " \n", " for fasta in \"${directory}\"/*.fasta\n", "\tdo\n", " grep \">\" \"${fasta}\" | awk 'sub(/^>/, \"\")'\n", "\tdone | sort -u >> \"${seqtk_list}\"\n", " \n", " \n", " echo \"\"\n", " echo \"Finished with FastA ID extraction.\"\n", " echo \"\"\n", " echo \"Moving on to read extractions...\" \n", " echo \"\"\n", " echo \"\"\n", " \n", " \n", " ######################################################\n", "\t# Extract corresponding R1 and R2 reads using seqtk FastA ID list\n", " ######################################################\n", "\tfor fastq in \"${directory}\"/*R1*.gz\n", " do\n", " # Strip path from filename\n", " fastq_nopath=${fastq##*/}\n", " \n", " # Get sample ID from FastQ filename\n", " sample=$(echo \"${fastq_nopath}\" | awk -F \"_\" '{print $1}')\n", "\n", " # Ignore sample 304428 - it's a general pool of various sample types\n", " if [ \"${sample}\" != \"304428\" ]\n", " then\n", "\n", " # Pull infection status, sample day and temp from associative arrays\n", " inf_status=${inf_status_array[$sample]}\n", " sample_day=${sample_day_array[$sample]}\n", " temp=${sample_temp_array[$sample]}\n", " \n", " # Set output filename\n", " ## Does not set temp value in filename when the temp value is empty in array\n", " if [[ ${sample_temp_array[$sample]} ]]; then\n", " R1_out=\"${prefix}.${sample}.${sample_day}.${inf_status}.${temp}.${R1_suffix}\"\n", " else\n", " R1_out=\"${prefix}.${sample}.${sample_day}.${inf_status}.${R1_suffix}\"\n", " fi\n", " \n", " \n", " echo \"Extracting R1 reads from ${fastq}.\"\n", " echo \"\"\n", " echo \"Writing R1 reads to ${R1_out}\"\n", " echo \"\"\n", " echo \"\"\n", " \n", " # Use seqtk to pull out desired FastQ reads\n", " \t ${seqtk} subseq \"${fastq}\" \"${seqtk_list}\" >> \"${R1_out}\"\n", " fi\n", " done\n", " \n", " echo \"\"\n", " echo \"Done with R1 read extractions\"\n", " echo \"\"\n", " echo \"${line}\"\n", " echo \"\"\n", "\n", "\tfor fastq in \"${directory}\"/*R2*.gz\n", "\tdo\n", " # Strip path from filename\n", " fastq_nopath=${fastq##*/}\n", " \n", " # Get sample ID from FastQ filename\n", " sample=$(echo \"${fastq_nopath}\" | awk -F \"_\" '{print $1}')\n", "\n", "\t # Ignore sample 304428 - it's a general pool of various sample types\n", " if [ \"${sample}\" != \"304428\" ]\n", " then\n", "\t\t\t\n", " # Pull infection status and sample day from associative array \n", " inf_status=${inf_status_array[$sample]}\n", " sample_day=${sample_day_array[$sample]}\n", " temp=${sample_temp_array[$sample]}\n", " \n", " # Set output filename\n", " ## Does not set temp value in filename when the temp value is empty in array\n", " if [[ ${sample_temp_array[$sample]} ]]; then\n", " R2_out=\"${prefix}.${sample}.${sample_day}.${inf_status}.${temp}.${R2_suffix}\"\n", " else\n", " R2_out=\"${prefix}.${sample}.${sample_day}.${inf_status}.${R2_suffix}\"\n", " fi\n", " \n", " echo \"Extracting R2 reads from ${fastq}.\"\n", " echo \"\"\n", " echo \"Writing R2 reads to ${R2_out}\"\n", " echo \"\"\n", " echo \"\"\n", " \n", " \t ${seqtk} subseq \"${fastq}\" \"${seqtk_list}\" >> \"${R2_out}\"\n", " fi\n", "\tdone\n", " \n", " echo \"${line}\"\n", " echo \"\"\n", " # Print working directory and list files\n", " pwd\n", "\tls -lh\n", " echo \"\"\n", " echo \"${line}\"\n", " echo \"\"\n", "done\n", "\n", "# Caputure end \"time\"\n", "end=${SECONDS}\n", "\n", "# Calculate runtime\n", "runtime=$((end-start))\n", "\n", "# Print runtime, in seconds\n", "echo \"\"\n", "echo \"${line}\"\n", "echo \"\"\n", "echo \"Total runtime was: ${runtime} seconds\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 4 }