#!/bin/bash usage(){ echo " Written by Brian Bushnell Last modified September 4, 2018 Description: Error corrects reads and/or filters by depth, storing kmer counts in a count-min sketch (a Bloom filter variant). This uses a fixed amount of memory. The error-correction algorithm is taken from Tadpole; with plenty of memory, the behavior is almost identical to Tadpole. As the number of unique kmers in a dataset increases, the accuracy decreases such that it will make fewer corrections. It is still capable of making useful corrections far past the point where Tadpole would crash by running out of memory, even with the prefilter flag. But if there is sufficient memory to use Tadpole, then Tadpole is more desirable. Because accuracy declines with an increasing number of unique kmers, it can be useful with very large datasets to run this in 2 passes, with the first pass for filtering only using a 2-bit filter with the flags tossjunk=t and ecc=f (and possibly mincount=2 and hcf=0.4), and the second pass using a 4-bit filter for the actual error correction. Usage: bbcms.sh in= out= outb= Example of use in error correction: bbcms.sh in=reads.fq out=ecc.fq bits=4 hashes=3 k=31 merge Example of use in depth filtering: bbcms.sh in=reads.fq out=high.fq outb=low.fq k=31 mincount=2 ecc=f hcf=0.4 Error correction and depth filtering can be done simultaneously. File parameters: in= Primary input, or read 1 input. in2= Read 2 input if reads are in two files. out= Primary read output. out2= Read 2 output if reads are in two files. outb= (outbad/outlow) Output for reads failing mincount. outb2= (outbad2/outlow2) Read 2 output if reads are in two files. extra= Additional comma-delimited files for generating kmer counts. ref= If ref is set, then only files in the ref list will be used for kmer counts, and the input files will NOT be used for counts; they will just be filtered or corrected. overwrite=t (ow) Set to false to force the program to abort rather than overwrite an existing file. Hashing parameters: k=31 Kmer length, currently 1-31. hashes=3 Number of hashes per kmer. Higher generally reduces false positives at the expense of speed; rapidly diminishing returns above 4. ksmall= Optional sub-kmer length; setting to slightly lower than k can improve memory efficiency by reducing the number of hashes needed. e.g. 'k=31 ksmall=29 hashes=2' has better speed and accuracy than 'k=31 hashes=3' when the filter is very full. minprob=0.5 Ignore kmers with probability of being correct below this. memmult=1.0 Fraction of free memory to use for Bloom filter. 1.0 should generally work; if the program crashes with an out of memory error, set this lower. You may be able to increase accuracy by setting it slightly higher. cells= Option to set the number of cells manually. By default this will be autoset to use all available memory. The only reason to set this is to ensure deterministic output. seed=0 This will change the hash function used. Useful if running iteratively with a very full table. -1 uses a random seed. Depth filtering parameters: mincount=0 If positive, reads with kmer counts below mincount will be discarded (sent to outb). hcf=1.0 (highcountfraction) Fraction of kmers that must be at least mincount to pass. requireboth=t Require both reads in a pair to pass in order to go to out. When true, if either read has a count below mincount, both reads in the pair will go to outb. When false, reads will only go to outb if both fail. tossjunk=f Remove reads or pairs with outermost kmer depth below 2. (Suggested params for huge metagenomes: mincount=2 hcf=0.4 tossjunk=t) Error correction parameters: ecc=t Perform error correction. bits= Bits used to store kmer counts; max count is 2^bits-1. Supports 2, 4, 8, 16, or 32. 16 is best for high-depth data; 2 or 4 are for huge, low-depth metagenomes that saturate the bloom filter otherwise. Generally 4 bits is recommended for error-correction and 2 bits is recommended for filtering only. ecco=f Error-correct paired reads by overlap prior to kmer-counting. merge=t Merge paired reads by overlap prior to kmer-counting, and again prior to correction. Output will still be unmerged. smooth=3 Remove spikes from kmer counts due to hash collisions. The number is the max width of peaks to be smoothed; range is 0-3 (3 is most aggressive; 0 disables smoothing). This also affects tossjunk. Java Parameters: -Xmx This will set Java's memory usage, overriding autodetection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory. -eoom This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+. -da Disable assertions. Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems. " } #This block allows symlinked shellscripts to correctly set classpath. pushd . > /dev/null DIR="${BASH_SOURCE[0]}" while [ -h "$DIR" ]; do cd "$(dirname "$DIR")" DIR="$(readlink "$(basename "$DIR")")" done cd "$(dirname "$DIR")" DIR="$(pwd)/" popd > /dev/null #DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/" CP="$DIR""current/" z="-Xmx4g" z2="-Xms4g" set=0 if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then usage exit fi calcXmx () { source "$DIR""/calcmem.sh" setEnvironment parseXmx "$@" if [[ $set == 1 ]]; then return fi freeRam 4000m 84 z="-Xmx${RAM}m" z2="-Xms${RAM}m" } calcXmx "$@" bloomfilter() { local CMD="java $EA $EOOM $z $z2 -cp $CP bloom.BloomFilterCorrectorWrapper $@" echo $CMD >&2 eval $CMD } bloomfilter "$@"