## Use taxonomic read classifications from MEGAN6 to extract day and treatment Phylum-specific FastQs from Arthropoda and Alveolata

In [1]:
%%bash
echo "TODAY'S DATE:"
date
echo "------------"
echo ""
#Display operating system info
lsb_release -a
echo ""
echo "------------"
echo "HOSTNAME: "; hostname 
echo ""
echo "------------"
echo "Computer Specs:"
echo ""
lscpu
echo ""
echo "------------"
echo ""
echo "Memory Specs"
echo ""
free -mh

TODAY'S DATE:
Sun Apr 19 13:42:23 PDT 2020
------------

Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.6 LTS
Release:	16.04
Codename:	xenial

------------
HOSTNAME: 
swoose

------------
Computer Specs:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Model name:            Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz
Stepping:              2
CPU MHz:               2925.931
BogoMIPS:              5851.96
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     0-23
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr

No LSB modules are available.


### Set variables

In [1]:
# Set data directories
%env crab_data=/home/sam/data/C_bairdi/RNAseq
%env hemat_data=/home/sam/data/Hematodinium/RNAseq
%env wd=/home/sam/analyses
%env line=------------------------------------------------------------------------

# Programs
%env seqtk=/home/sam/programs/seqtk-1.3/seqtk

env: crab_data=/home/sam/data/C_bairdi/RNAseq
env: hemat_data=/home/sam/data/Hematodinium/RNAseq
env: wd=/home/sam/analyses
env: line=------------------------------------------------------------------------
env: seqtk=/home/sam/programs/seqtk-1.3/seqtk


#### Input data are here:

FastAs: https://gannet.fish.washington.edu/Atumefaciens/20200419_cbai_MEGAN_read_extractions/

Trimmed-FastQs: https://gannet.fish.washington.edu/Atumefaciens/20200414_cbai_RNAseq_fastp_trimming/

### Extract phylum-specific reads from trimmed FastQ files


Use FastA IDs from MEGAN6 taxonomic read extraction FastAs to pull out appropriate reads from each phylum (Arthropoda and Alveolata). This is performed because MEGAN6 strips paired read ID after the first space. As such, the resulting read extractions using MEGAN end up with a FastA file containing two reads with identicial headers. Not sure if this will cause any downstream issues (i.e. with Trinity) where paired end data is used, so playing it safe and using the truncated IDs to pull FastQs with complete sequence headers for use in subsequent data wrangling.

In [2]:
%%bash
# Capture start "time"
# Uses builtin bash variable called ${SECONDS}
start=${SECONDS}

# Set timestamp
timestamp=20200419

# Create associative arrays
# NOTE: These will require Bash >4.0

## Infection status
declare -A inf_status_array
## Sampling day
declare -A sample_day_array
## Sample temperature array
declare -A sample_temp_array

# Create sample list
{
echo "380820_D9_uninfected_ambient"
echo "380821_D9_infected_ambient"
echo "380822_D12_uninfected_cold"
echo "380823_D12_infected_cold"
echo "380824_D12_uninfected_warm"
echo "380825_D12_infected_warm"
} >> crab-sample-list.txt


# Populate arrays
# Uses underscore as internal field separator (IFS)
# Reads each field in as a variable name (e.g. sampl)
while IFS="_" read -r sample day infection temp
do
  inf_status_array[$sample]=$infection
  sample_day_array[$sample]=$day
  sample_temp_array[$sample]=$temp
done < crab-sample-list.txt

# Remove the sample list file
rm crab-sample-list.txt

# Check the arrays

echo "Printing array values:"
echo ""

for key in "${!inf_status_array[@]}"
do
  printf "[%s]=%s,%s,%s\n" \
  "$key" \
  "${inf_status_array[$key]}" \
  "${sample_day_array[$key]}" \
  "${sample_temp_array[$key]}"
done

echo ""
echo "${line}"
echo ""

for directory in ${crab_data} ${hemat_data}
do
	# Get species name
	species=$(echo "${directory}" | awk -F"/" '{print $5}')
    
    # Make new directory and change to that directory ("$_" means use previous command's argument)
    mkdir --parents "${wd}"/"${timestamp}"_"${species}"_megan_reads \
    && cd "$_" || exit

	# Set seqtk list filename
	seqtk_list=${timestamp}.${species}.seqtk.read_id.list

	# Set output FastQ filenames
    prefix=${timestamp}.${species}
    R1_suffix=megan_R1.fq
    R2_suffix=megan_R2.fq

	######################################################
	# Create FastA IDs list to use for sequence extraction
	######################################################
    for fasta in "${directory}"/*.fasta
	do
      echo "${fasta}" >> fasta-list.txt
    done
    
    for fasta in "${directory}"/*.fasta
	do
      grep ">" "${fasta}" | awk 'sub(/^>/, "")'
	done | sort -u >> "${seqtk_list}"
    
    
    echo ""
    echo "Finished with FastA ID extraction."
    echo ""
    echo "Moving on to read extractions..." 
    echo ""
    echo ""
    
    
    ######################################################
	# Extract corresponding R1 and R2 reads using seqtk FastA ID list
    ######################################################
	for fastq in "${directory}"/*R1*.gz
    do
      # Strip path from filename
      fastq_nopath=${fastq##*/}
                     
      # Get sample ID from FastQ filename
      sample=$(echo "${fastq_nopath}" | awk -F "_" '{print $1}')

      # Ignore sample 304428 - it's a general pool of various sample types
      if [ "${sample}" != "304428" ]
      then

        # Pull infection status, sample day and temp from associative arrays
        inf_status=${inf_status_array[$sample]}
        sample_day=${sample_day_array[$sample]}
        temp=${sample_temp_array[$sample]}
          
        # Set output filename
        ## Does not set temp value in filename when the temp value is empty in array
        if [[ ${sample_temp_array[$sample]} ]]; then
          R1_out="${prefix}.${sample}.${sample_day}.${inf_status}.${temp}.${R1_suffix}"
        else
          R1_out="${prefix}.${sample}.${sample_day}.${inf_status}.${R1_suffix}"
        fi
        
   
        echo "Extracting R1 reads from ${fastq}."
        echo ""
        echo "Writing R1 reads to ${R1_out}"
        echo ""
        echo ""
      
        # Use seqtk to pull out desired FastQ reads
  	    ${seqtk} subseq "${fastq}" "${seqtk_list}" >> "${R1_out}"
      fi
    done
    
    echo ""
    echo "Done with R1 read extractions"
    echo ""
    echo "${line}"
    echo ""

	for fastq in "${directory}"/*R2*.gz
	do
       # Strip path from filename
      fastq_nopath=${fastq##*/}
                     
      # Get sample ID from FastQ filename
      sample=$(echo "${fastq_nopath}" | awk -F "_" '{print $1}')

	  # Ignore sample 304428 - it's a general pool of various sample types
      if [ "${sample}" != "304428" ]
      then
			
        # Pull infection status and sample day from associative array  
        inf_status=${inf_status_array[$sample]}
        sample_day=${sample_day_array[$sample]}
        temp=${sample_temp_array[$sample]}
          
        # Set output filename
        ## Does not set temp value in filename when the temp value is empty in array
        if [[ ${sample_temp_array[$sample]} ]]; then
          R2_out="${prefix}.${sample}.${sample_day}.${inf_status}.${temp}.${R2_suffix}"
        else
          R2_out="${prefix}.${sample}.${sample_day}.${inf_status}.${R2_suffix}"
        fi
   
        echo "Extracting R2 reads from ${fastq}."
        echo ""
        echo "Writing R2 reads to ${R2_out}"
        echo ""
        echo ""
                     
  	   ${seqtk} subseq "${fastq}" "${seqtk_list}" >> "${R2_out}"
     fi
	done
    
    echo "${line}"
    echo ""
    # Print working directory and list files
    pwd
	ls -lh
    echo ""
    echo "${line}"
    echo ""
done

# Caputure end "time"
end=${SECONDS}

# Calculate runtime
runtime=$((end-start))

# Print runtime, in seconds
echo ""
echo "${line}"
echo ""
echo "Total runtime was: ${runtime} seconds"

Printing array values:

[380823]=infected,D12,cold
[380822]=uninfected,D12,cold
[380821]=infected,D9,ambient
[380820]=uninfected,D9,ambient
[380825]=infected,D12,warm
[380824]=uninfected,D12,warm

------------------------------------------------------------------------


Finished with FastA ID extraction.

Moving on to read extractions...


Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380820_S1_L001_R1_001.fastp-trim.202004143431.fq.gz.

Writing R1 reads to 20200419.C_bairdi.380820.D9.uninfected.ambient.megan_R1.fq


Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380820_S1_L002_R1_001.fastp-trim.202004143700.fq.gz.

Writing R1 reads to 20200419.C_bairdi.380820.D9.uninfected.ambient.megan_R1.fq


Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380821_S2_L001_R1_001.fastp-trim.202004143925.fq.gz.

Writing R1 reads to 20200419.C_bairdi.380821.D9.infected.ambient.megan_R1.fq


Extracting R1 reads from /home/sam/data/C_bairdi/RNAseq/380821_S2_L002_R1_001.fastp-tr