Mayo GenomeGPS on iForge: 
User Documentation and Standard Operating Procedures


Table of Contents

.Pipeline architecture: 3 blocks and 6 run cases
Runfile options
## i/o
## choose the run case
## input data
## tools to be used
## preparatory block
## alignment block parameters
## realign/recalibrate block parameters
## variant calling block parameters
## other parameters - DO NOT EDIT
## paths to input output and tools - DO NOT EDIT
## pbs resources - DO NOT EDIT
Samplenames file format
Single bam input
Single fastq input
Multiple bam inputs
Multiple fastq inputs
Step-by-step instructions for setting up the runfile
Case 1: only alignment
Case 2: only realignment /recalibration
Case 3: only variant calling
Case 4: alignment + realignment /recalibration
Case 5: realignment /recalibration + variant calling
Case 6: entire pipeline = alignment + realignment /recalibration + variant calling.


Changelog 2.32

* samplenames are no longer listed in runfile
* epilogue is gone
* new convention on input fastq names: must be in form of samplename_read?.fq or samplename_read?.fastq
* runfile now must contain INPUTFORMAT flag to dostinguish between fastq and bam

Changelog 2.33

* alignfastq.sh -- changed names of all qsub variables to be more descriptive
* realign.sh -- changed names of all qsub variables to be more descriptive
* realign_new.sh -- changed names of all qsub variables to be more descriptive
* alignfastq.sh -- realign.sh is now submitted with a dependency on merge jobs
* realign.sh -- now runs 
* now in runfile the chromosomes are specified explicitly as given in the full
* genome fastq, not just the numbers; this way any species can be analyzed;
* made appropriate changes in realign_new.sh
* introduce the Anisimov switch: 
** want to run real/recal and variant calling via the Launcher. But,
realign_new still has to be a separate scheduled job, because there is a case
when it depends on sortbammayo and extract_reads_bam: when bams were obtained
from elsewhere and need to be resorted.
** introduced user-level variable RUNMETHOD ($run_method), with values:
*** ANISIMOV - means we use the anisimov launcher
*** QSUB - means we schedule each job as an individual qsub without aprun
*** APRUN - means we schedule each job as an individual qsub with aprun
*** SERVER - means we run the entire workflow on the same machine in series
* renamed user-define variable TYPE ($type) into INPUTTYPE ($input_type), because 
"type" seems to be some sort of a special variable (highlighted in the editor), 
even though it is not a reserved variable in bash...
* introduced variable "profiling" to choose whether to use memprof, cray Profiler or something else
* realign.sh calls realign_new.sh; in 2.32 it calls one realign_new per sample 
in the case of independent samples, and a single realign_new for the multisample case.
In either case, one of the input parameters is the directory where to look for
the aligned bams. I am rewriting the code so that a single realign_new is called in any
case, with the appropriate folder for aligned bams, depending on the case.
* moved creating SAMPLENAMES.list file into main.sh -- should not have to do this in every piece of code!!
* intended to run all realrecall in a single joblist per chromosome. But, multisample case 
makes it more sensible to have one joblist for sortnode, as it is independent of multisample
variable; then if multisample then we only use a single qsub for realrecalold, and a single one for vcall;
but for independent samples, we do one joblist for realrecal per chromosome, and one joblist
for vcall per chromosome
* for each qsub file, we now construct a command file. because we have extra flexibility of 
using qsub, aprun, launcher or serial; then the command gets modified and contents put into qsub or jobfile

#################################################
############ NEED TO RENAME SORTNODE AND REALIGN, REALIGN_NEW AND REALRECAL.OLD ################
also vcallgatk, vcallmain
mainaln
#################################################


Pipeline architecture: 3 blocks and 6 run cases


    |    Alignment block    |    Realignment/recalibration block    |    Variant calling block    |     Perform only alignment    |                        |    
Case 1    |    ANALYSIS=ALIGN    |            ---            |        ---
    |    RESORTBAM=NO    |                        |

    |                |    Perform only realignment/recalibration    |
Case 2    |        ---        |    ANALYSIS=REALIGN_ONLY        |        ---
    |                |    SKIPVCALL=YES            |

    |                |                        |     Perform only variant calling
Case 3    |        ---        |            ---            |     ANALYSIS=VCALL_ONLY
    |                |                        |     SKIPVCALL=NO

    |        Perform alignment + realignment/recalibration            |
    |            ANALYSIS=REALIGN                    |
Case 4    |            SKIPVCALL=YES                    |        ---
    |            RESORTBAM=NO                    |

    |                |            Perform realignment/recalibration + variant calling
Case 5    |        ---        |                ANALYSIS=REALIGN_ONLY
    |                |                SKIPVCALL=NO

    |        Invoke the entire pipeline: alignment + realignment/recalibration + variant calling
Case 6    |                    ANALYSIS=REALIGN
    |                    SKIPVCALL=NO
    |                    RESORTBAM=NO


Runfile options
{OPTION1 | OPTION2} notation means you have a choice between the two options.

## i/o
INPUTDIR=/full/path/to/folder/with/input/files
SAMPLEFILENAMES=/full/path/file.samplenames
OUTPUTDIR=/projects/mayo/GGPSresults/meaningful_output_dir_name
EMAIL=you@email
## choose the run case 
ANALYSIS={ ALIGN | REALIGN | REALIGN_ONLY | VCALL_ONLY }
SKIPVCALL={ YES | NO }
RESORTBAM={ YES | NO }    ## NO is always used if ANALYSIS=REALIGN
## input data
PAIRED={ 1 | 0 }          ## 1 for paired-ended reads, 0 for single-ended reads
READLENGTH=100            ## or whatever the read length is
MULTISAMPLE={ YES | NO }
PROVENANCE={ MULTI_SOURCE | SINGLE_SOURCE }
SAMPLEINFORMATION=end2end multisamples
SAMPLENAMES=multisample1 multisample2  ## etcetera, as many words as there are samples
                                       ## these names must match those in samplenames file
SAMPLEID=sample_id_tag
SAMPLELB=hg19
SAMPLEPL=illumina
SAMPLEPU=sample_pu_tag
SAMPLESM=sample_sm_tag
SAMPLECN=Mayo
TYPE={ exome | whole_genome }
DISEASE=cancer
GROUPNAMES=NA
LABINDEX=-:-
LANEINDEX=1:2
## tools to be used
JAVAMODULE=java-1.6
ALIGNER={ NOVOALIGN | BWA }
SORTMERGETOOL={ NOVOSORT | PICARD }
SNV_CALLER=GATK
SOMATIC_CALLER=SOMATICSNIPPER
## preparatory block
BAM2FASTQFLAG={ YES | NO }        ## NO if input if fastq, or cases 2, 3 or 5 are invoked
BAM2FASTQPARMS=INCLUDE_NON_PF_READS=true
REVERTSAM={ 1 | 0 }
FASTQCFLAG={ YES | NO }
FASTQCPARMS=-t 15 -q
## alignment block parameters
BWAPARAMS=-l 32 -t 16
NOVOPARAMS=-g 60 -x 2 -i PE 425,80 -r Random --hdrhd off -v 120 -c 16
BLATPARAMS=-w 50 -m 70 -t 90
## realign/recalibrate block parameters
REALIGNPARMS=
MARKDUP=YES
REMOVE_DUP=NO
REORDERSAM=NO
REALIGNORDER=1

## regions of interest
CHRINDEX=1:2:3:4         ## or whatever regions of the reference genome are of interest
## variant calling block parameters
PEDIGREE=NA
VARIANT_TYPE=BOTH
UNIFIEDGENOTYPERPARMS=-maxAlleles 5
SNVMIX2PARMS=
SNVMIX2FILTER=-p 0.8
## other parameters - DO NOT EDIT
GENOMEBUILD=hg19
EMIT_ALL_SITES=YES
DEPTH_FILTER=0
TARGETTED=NO
## paths to input output and tools - DO NOT EDIT
REFGENOMEDIR=/projects/mayo/reference
REFGENOME=mayo_novo/allchr.fa
DBSNP=mayo_dbsnp/hg19/dbsnp_135.hg19.vcf.gz
KGENOME=kGenome/hg19/kgenome.hg19.vcf
ONTARGET=/projects/mayo/reference/agilentOnTarget
NOVOINDEX=mayo_novo/allchr.nix
BWAINDEX=mayo_novo/allchr.fa
NOVODIR=/projects/mayo/builds/novocraft
BWADIR=/projects/mayo/builds/bwa-0.5.9
PICARDIR=/projects/mayo/builds/picard-tools-1.77
GATKDIR=/projects/mayo/builds/GATK-1.6-9
SAMDIR=/projects/mayo/builds/samtools-0.1.18
FASTQCDIR=/projects/mayo/builds/FastQC
SCRIPTDIR=/projects/mayo/scripts
SNVMIXDIR=/projects/mayo/builds/SNVMix2-0.11.8-r5
DELIVERYFOLDER=delivery
IGVDIR=IGV_BAM
## pbs resources - DO NOT EDIT
PBSPROJECTID=bf0
PBSNODES=8
PBSTHREADS=16
PBSQUEUEEXOME=normal
PBSQUEUEWGEN=long
PBSCPUALIGNWGEN=240:00:00
PBSCPUALIGNEXOME=48:00:00
PBSCPUOTHERWGEN=240:00:00
PBSCPUOTHEREXOME=48:00:00
Samplenames file format


Note: the sample names must match those in runfile in all cases.
Single bam input
BAM:some_meaningul_samplename=/Full/path/to/inputfilename.bam
The .bam extension is obligatory.
Example:    /projects/mayo/scripts/config/BamInput.samplenames

Single fastq input
FASTQ:samplename=/Full/path/to/reads1.fastq /Full/path/to/reads2.fastq
File names for left reads and right reads are separated by a space.
Example:    /projects/mayo/scripts/config/FastqInput_SingleSample.samplenames

Multiple bam inputs
BAM:some_meaningul_samplename1=/Full/path/to/inputfilename1.bam
BAM:some_meaningul_samplename2=/Full/path/to/inputfilename2.bam
..   etcetera
The sample names field specified here will be used as sampletags during realign/recalibration, unless the user specifies BAM2FASTQFLAG=YES and PROVENANCE=SINGLE_SOURCE. In this case the tag will be whatever already exists in @RG lines of the bam file. 
Examples:    /projects/mayo/scripts/config/BamInput_AlignedOnly_Multisamples.samplenames
        /projects/mayo/scripts/config/BamInput_Multisample.samplenames

Multiple fastq inputs
FASTQ:samplename1=/Full/path/to/reads11.fastq /Full/path/to/reads12.fastq
FASTQ:samplename2=/Full/path/to/reads21.fastq /Full/path/to/reads22.fastq
..   etcetera
File names for left reads and right reads are separated by a space. The order matters: specify left reads immediately after the .=. sign, and write the right reads (if any) on the same line.
Example:    /projects/mayo/scripts/config/FastqInput.samplenames


Step-by-step instructions for setting up the runfile
Case 1: only alignment 
This case covers both single sample and multisample inputs, exome and whole-genome data, bam and fastq input file format. The output is one aligned bam file per sample.

##i/o
Step 1: copy file /projects/mayo/scripts/config/template.runfile to your home directory.
Step 2: provide full path to the folder where input files are located by editing fields INPUTDIR and SAMPLEDIR.
Step 3: provide full path to the samplenames file by editing field SAMPLEFILENAMES.
Step 4: provide full path to the output folder by editing fields OUTPUTDIR.
Step 5: set your email in the field EMAIL.
## choose the run case
Step 6: ANALYSIS=ALIGN.
Step 7: RESORTBAM=NO.
Step 8: leave blank SKIPVCALL=  
## input data
Step 9: specify whether data are paired-ended (PAIRED=1) or single-ended (PAIRED=0).
Step 10: specify read length.
Step 11: if supplying multiple input files, are data independent (MULTISAMPLE=NO) or samples in the same experiment (MULTISAMPLE=YES)?
Step 12: if the data are multisample, do they come from the same source and have the same sample names among the input files (PROVENANCE=SINGLE_SOURCE), or not (PROVENANCE=MULTI_SOURCE)? The option PROVENANCE=SINGLE_SOURCE in the runfile will  cause the pipeline to derive the sample names directly from the @RG tags of the bams, if bam2fastq conversion is performed.
Step 13: specify SAMPLEINFORMATION
Step 14: list sample names separated by space in the field SAMPLENAMES. These must match the names in the samplenames file. If the data are from multiple sources and the sample names do not match among the input files, then just use multisample1, etc, like in the template.
Step 15: edit other fields in this section as is appropriate for the experiment.
## tools to be used
Step 16: choose aligner tool and sort/merge tool, as requested by the PI.
## preparatory block
Step 17: 
if the input files are bam, then set BAM2FASTQFLAG=YES and choose whether to perform picard revertsam before the conversion (REVERTSAM=1) or not (REVERTSAM=0)
if the input files are fastq, then set BAM2FASTQFLAG=NO and leave blank  REVERTSAM=

 
Example:    /projects/mayo/scripts/config/FastqInput_AlignOnly.runfile

Case 2: only realignment /recalibration

##i/o
Step 1: copy file /projects/mayo/scripts/config/template.runfile to your home directory.
Step 2: provide full path to the folder where input files are located by editing fields INPUTDIR and SAMPLEDIR
Step 3: provide full path to the samplenames file by editing field SAMPLEFILENAMES
Step 4: provide full path to the output folder by editing fields OUTPUTDIR
Step 5: set your email in the field EMAIL
## choose the run case
Step 6: ANALYSIS=REALIGN_ONLY
Step 7: SKIPVCALL=YES  
Step 8: set whether to resort the input bam (RESORTBAM=YES) or not (RESORTBAM=NO) 
## input data
Step 9: specify whether data are paired-ended (PAIRED=1) or single-ended (PAIRED=0).
Step 10: specify read length.
Step 11: if supplying multiple input files, are data independent (MULTISAMPLE=NO) or samples in the same experiment (MULTISAMPLE=YES)?
Step 12: leave blank PROVENANCE=
Step 13: specify SAMPLEINFORMATION
Step 14: list sample names separated by space in the field SAMPLENAMES. These must match the names in the samplenames file. If the data are from multiple sources and the sample names do not match among the input files, then just use multisample1, etc, like in the template.
Step 15: edit other fields in this section as is appropriate for the experiment.
## tools to be used
Step 16: choose aligner tool and sort/merge tool, as requested by the PI.
## preparatory block
Step 17: 
leave blank BAM2FASTQFLAG=
if the input bam files have already been realigned and recalibrated, then  REVERTSAM=1; otherwise, REVERTSAM=0
## regions of interest
Step 18: list the regions of interest (CHRINDEX=1:2:3:4 or whatever).


Example:     /projects/mayo/scripts/config/BamInput_RealignOnly_onAlignedOnly_Multisamples.runfile

Case 3: only variant calling

At present, the variant calling block treats all input files as independent, producing .vcf per chromosome, for each input file. This seemed the most meaningful way to separate out the variant calling block. Mayo's original pipeline merges samples after alignment, and from that point on analyzes a single bam file. If we receive multiple bam files, the presumption is that they are
   a) either samples from different experiments, and thus
      should not be variant-called together,
   b) or they are post-alignment bams from samples of
      the same experiment, and they need to be realigned/recalibrated
      to obtain a single bam.

##i/o
Step 1: copy file /projects/mayo/scripts/config/template.runfile to your home directory.
Step 2: provide full path to the folder where input files are located by editing fields INPUTDIR and SAMPLEDIR
Step 3: provide full path to the samplenames file by editing field SAMPLEFILENAMES
Step 4: provide full path to the output folder by editing fields OUTPUTDIR
Step 5: set your email in the field EMAIL
## choose the run case
Step 6: ANALYSIS=VCALL_ONLY
Step 7: SKIPVCALL=NO  
Step 8: leave blank RESORTBAM= 
## input data
Step 9: specify whether data are paired-ended (PAIRED=1) or single-ended (PAIRED=0).
Step 10: specify read length.
Step 11: leave blank MULTISAMPLE=
Step 12: leave blank PROVENANCE=
Step 13: specify SAMPLEINFORMATION
Step 14: list sample names separated by space in the field SAMPLENAMES. These must match the names in the samplenames file. 
Step 15: edit other fields in this section as is appropriate for the experiment.
## tools to be used
Step 16: leave blank ALIGNER= and SORTMERGETOOL=
## preparatory block
Step 17: 
leave blank BAM2FASTQFLAG=
leave blank REVERTSAM=
## regions of interest
Step 18: list the regions for variant calling as appropriate (CHRINDEX=1:2:3:4 or whatever).


Example:     /projects/mayo/scripts/config/BamInput_VariantcallOnly_Multisample.runfile


Case 4: alignment + realignment /recalibration

##i/o
Step 1: copy file /projects/mayo/scripts/config/template.runfile to your home directory.
Step 2: provide full path to the folder where input files are located by editing fields INPUTDIR and SAMPLEDIR
Step 3: provide full path to the samplenames file by editing field SAMPLEFILENAMES
Step 4: provide full path to the output folder by editing fields OUTPUTDIR
Step 5: set your email in the field EMAIL
## choose the run case
Step 6: ANALYSIS=REALIGN
Step 7: SKIPVCALL=YES  
Step 8: RESORTBAM=NO 
## input data
Step 9: specify whether data are paired-ended (PAIRED=1) or single-ended (PAIRED=0).
Step 10: specify read length.
Step 11: if supplying multiple input files, are data independent (MULTISAMPLE=NO) or samples in the same experiment (MULTISAMPLE=YES)?
Step 12: if the data are multisample, do they come from the same source and have the same sample names among the input files (PROVENANCE=SINGLE_SOURCE), or not (PROVENANCE=MULTI_SOURCE)? The option PROVENANCE=SINGLE_SOURCE in the runfile will  cause the pipeline to derive the sample names directly from the @RG tags of the bams, if bam2fastq conversion is performed.
Step 13: specify SAMPLEINFORMATION
Step 14: list sample names separated by space in the field SAMPLENAMES. These must match the names in the samplenames file. If the data are from multiple sources and the sample names do not match among the input files, then just use multisample1, etc, like in the template.
Step 15: edit other fields in this section as is appropriate for the experiment.
## tools to be used
Step 16: choose aligner tool and sort/merge tool, as requested by the PI
## preparatory block
Step 17: 
if the input files are bam, then set BAM2FASTQFLAG=YES and choose whether to perform picard revertsam before the conversion (REVERTSAM=1) or not (REVERTSAM=0)
if the input files are fastq, then set BAM2FASTQFLAG=NO and leave blank  REVERTSAM=
## regions of interest
Step 18: list the regions for variant calling as appropriate (CHRINDEX=1:2:3:4 or whatever).


Example:    /projects/mayo/scripts/config/FastqInput_AlignRealign.runfile

Case 5: realignment /recalibration + variant calling

##i/o
Step 1: copy file /projects/mayo/scripts/config/template.runfile to your home directory.
Step 2: provide full path to the folder where input files are located by editing fields INPUTDIR and SAMPLEDIR
Step 3: provide full path to the samplenames file by editing field SAMPLEFILENAMES
Step 4: provide full path to the output folder by editing fields OUTPUTDIR
Step 5: set your email in the field EMAIL
## choose the run case
Step 6: ANALYSIS=REALIGN_ONLY
Step 7: SKIPVCALL=NO  
Step 8: set whether to resort the input bam (RESORTBAM=YES) or not (RESORTBAM=NO) 
## input data
Step 9: specify whether data are paired-ended (PAIRED=1) or single-ended (PAIRED=0).
Step 10: specify read length.
Step 11: if supplying multiple input files, are data independent (MULTISAMPLE=NO) or samples in the same experiment (MULTISAMPLE=YES)?
Step 12: leave blank PROVENANCE=
Step 13: specify SAMPLEINFORMATION
Step 14: list sample names separated by space in the field SAMPLENAMES. These must match the names in the samplenames file. If the data are from multiple sources and the sample names do not match among the input files, then just use multisample1, etc, like in the template.
Step 15: edit other fields in this section as is appropriate for the experiment.
## tools to be used
Step 16: choose aligner tool and sort/merge tool, as requested by the PI.
## preparatory block
Step 17: 
leave blank BAM2FASTQFLAG=
if the input bam files have already been realigned and recalibrated, then  REVERTSAM=1; otherwise, REVERTSAM=0
## regions of interest
Step 18: list the regions for variant calling as appropriate (CHRINDEX=1:2:3:4 or whatever).


Example:    /projects/mayo/scripts/config/BamInput_RealignVariantcall.runfile


Case 6: entire pipeline = alignment + realignment /recalibration + variant calling

##i/o
Step 1: copy file /projects/mayo/scripts/config/template.runfile to your home directory.
Step 2: provide full path to the folder where input files are located by editing fields INPUTDIR and SAMPLEDIR.
Step 3: provide full path to the samplenames file by editing field SAMPLEFILENAMES.
Step 4: provide full path to the output folder by editing fields OUTPUTDIR.
Step 5: set your email in the field EMAIL.
## choose the run case
Step 6: ANALYSIS=REALIGN.
Step 7: RESORTBAM=NO.
Step 8: SKIPVCALL=NO  
## input data
Step 9: specify whether data are paired-ended (PAIRED=1) or single-ended (PAIRED=0).
Step 10: specify read length.
Step 11: if supplying multiple input files, are data independent (MULTISAMPLE=NO) or samples in the same experiment (MULTISAMPLE=YES)?
Step 12: if the data are multisample, do they come from the same source and have the same sample names among the input files (PROVENANCE=SINGLE_SOURCE), or not (PROVENANCE=MULTI_SOURCE)? The option PROVENANCE=SINGLE_SOURCE in the runfile will  cause the pipeline to derive the sample names directly from the @RG tags of the bams, if bam2fastq conversion is performed.
Step 13: specify SAMPLEINFORMATION
Step 14: list sample names separated by space in the field SAMPLENAMES. These must match the names in the samplenames file. If the data are from multiple sources and the sample names do not match among the input files, then just use multisample1, etc, like in the template.
Step 15: edit other fields in this section as is appropriate for the experiment.
## tools to be used
Step 16: choose aligner tool and sort/merge tool, as requested by the PI.
## preparatory block
Step 17: 
if the input files are bam, then set BAM2FASTQFLAG=YES and choose whether to perform picard revertsam before the conversion (REVERTSAM=1) or not (REVERTSAM=0)
if the input files are fastq, then set BAM2FASTQFLAG=NO and leave blank  REVERTSAM=
## variant calling block parameters
Step 18: list the regions for variant calling as appropriate (CHRINDEX=1:2:3:4 or whatever).


Example:    /projects/mayo/scripts/config/FastqInput_AlignRealignVariantcall.runfile