---
title: Variant Calling without a Reference
layout: page
aside:
  toc: true
---

In this tutorial, I use [GBS-SNP-CROP](https://github.com/halelab/GBS-SNP-CROP)
to identify variants in a population of tetraploid individuals without a
reference genome, starting with the FASTQ files.  I then export the allelic
read depth to VCF for genotype calling downstream.

## Preparing the sequencing data

This section shows how I prepared the data for this tutorial.  I did it a little
differently from how you might, partly because my lab uses custom adapters and
partly because I only wanted to use a few samples in order to create a small
dataset that would process quickly.  You might instead follow the GBS-SNP-CROP
[tutorial](https://github.com/halelab/GBS-SNP-CROP/wiki/GBS-SNP-CROP-User-Manual-(v.4.0))
directly for your own data.

My RAD-seq library was sequenced using single-end reads on a HiSeq 4000, with
inline barcodes (*i.e.* the barcode is at the beginning of each read, followed
by the remainder of the restriction cut site).  There are many tools available
for demultiplexing such data, including
[Fastx Toolkit](http://hannonlab.cshl.edu/fastx_toolkit/),
[Stacks](http://catchenlab.life.illinois.edu/stacks/), and
[GBS-SNP-CROP](https://github.com/halelab/GBS-SNP-CROP).  I used my own software
[TagDigger](https://github.com/lvclark/tagdigger) because it has the custom
adapters that my lab uses built into it, and will trim them out while performing
the demultiplexing.  The CSV passed to the `-b` argument
lists input file, barcode, and output file for every sample.  The argument `-a`
indicates what adapter set was used.  TagDigger and all of the other tools
mentioned generate one FASTQ file for each barcode in the original file, with
the barcode sequence removed.

I ran the code below using Windows Command Prompt, but it would also work in bash.

``` bash
python barcode_splitter_script.py -b 190419remainingMsa_barcodekey_Hiseq4000.csv -a PstI-MspI-Clark
```

Out of the files that were generated, we will use the following ones for further
analysis.  These are all tetraploid, non-hybrid *Miscanthus sacchariflorus*
accessions.  If you wish to download the FASTQ files in order to reproduce this
entire tutorial, the run accession numbers on NCBI-SRA are provided.

| File                    | Sample       | Provenance  | SRA        |
|-------------------------|--------------|-------------|------------|
| DOEMsa56a_CCTCCAGA.fq   | JM-2014-H-28 | Japan       | SRR8997766 |
| DOEMsa56a_CCATAAG.fq    | JM-2014-K-8  | Japan       | SRR8997767	|
| DOEMsa56a_GGTGGCCA.fq   | JM-2014-M-5  | Japan       | SRR8997511 |
| DOEMsa56a_GCGGTCCA.fq   | JM-2014-S-5  | Japan       | SRR8997617 |
| DOEMsa56a_TTGTTGTCTA.fq | JY134        | China       | SRR8997716	|
| DOEMsa56a_CCACCATCAG.fq | JY185        | China       | SRR8997496 |
| DOEMsa56a_GGTGCCA.fq    | KMS257       | South Korea | SRR8997613 |
| DOEMsa56a_TTCTGACA.fq   | RU2012-055   | Russia      | SRR8991820 |

The sequences then need to all be trimmed to the same length, in particular
because the barcodes were variable length and we don't want to call SNPs at the
end of the read where there will be a lot of missing data.  We also want to get
rid of any low-quality base calls to reduce the number of sequencing errors
called as SNPs.
[Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) can accomplish
this trimming, and we'll use it here also because it is what is used internally
by GBS-SNP-CROP.  We will trim the sequences to a maximum of 80 nucleotides,
then clip of low-quality sequence at the beginning or end, then discard reads
with fewer than 36 nucleotides.

Below is a bash script designed to run on a Linux cluster using Lmod to manage
software.  On your own Linux server, you might have to run
`java -jar /path/to/trimmomatic-0.38.jar` rather than just `trimmomatic`.

``` bash
#!/bin/bash

module load Trimmomatic

infiles=("DOEMsa56a_CCTCCAGA.fq" "DOEMsa56a_CCATAAG.fq"
         "DOEMsa56a_GGTGGCCA.fq" "DOEMsa56a_GCGGTCCA.fq"
         "DOEMsa56a_TTGTTGTCTA.fq" "DOEMsa56a_CCACCATCAG.fq"
         "DOEMsa56a_GGTGCCA.fq" "DOEMsa56a_TTCTGACA.fq")
outfiles=("JM-2014-H-28.R1.fq" "JM-2014-K-8.R1.fq"
          "JM-2014-M-5.R1.fq" "JM-2014-S-5.R1.fq"
          "JY134.R1.fq" "JY185.R1.fq" "KMS257.R1.fq" "RU2012-055.R1.fq")

for ((i=0;i<${#infiles[@]};++i)); do
    trimmomatic SE -threads 1 -phred64 ${infiles[i]} ${outfiles[i]} CROP:80 LEADING:3 TRAILING:3 MINLEN:36
done

gzip demultiplexed/*.R1.fq
```

The `gzip` line is necessary because GBS-SNP-CROP expects gzipped files.

## Assembling the mock reference

We'll make a tab-delimited text file, that we'll call `Msa_barcodes.txt`,
which will indicate the samples in this experiment.  The last column with `YES`
and `NO` indicates which samples to use for building the mock reference.  Two of
the files are smaller than the rest, which is why they are excluded.

```
CCTCCAGA	JM-2014-H-28	YES
CCATAAG	JM-2014-K-8	YES
GGTGGCCA	JM-2014-M-5	NO
GCGGTCCA	JM-2014-S-5	YES
TTGTTGTCTA	JY134	YES
CCACCATCAG	JY185	NO
GGTGCCA	KMS257	YES
TTCTGACA	RU2012-055	YES
```

Now we will use GBS-SNP-CROP to build the mock reference.

``` bash
#!/bin/bash

module load PEAR
module load VSEARCH
module load Perl

perl GBS-SNP-CROP-4.pl -pr pear -vs vsearch -d SE -b Msa_barcodes.txt -t 4 -rl 80 -MR Msa.MR
```

Again, we are loading `pear` and `vsearch` using Lmod, so we can use the program
names directly in place of specifying a path to them.  The argument `-d SE`
incidates that we have single end reads.  We will specify four processing cores
using `-t 4`, and that our reads are 80 nt long using `-rl 80`.  Finally, we
specify the file prefix for our mock reference genome using `-MR Msa.MR`.
("Msa" being an abbreviation that the Sacks lab commonly uses for
*Miscanthus sacchariflorus*.)

Once this is done running, there are some new files in your main directory,
including `Msa.MR.Genome.fa`, `Msa.MR.Clusters.fa`, and `PosToMask.txt`.
We can use `less` to inspect them.

``` bash
less Msa.MR.Genome.fa
```

![Beginning of Msa.MR.Genome.fa](calling_without_reference_images/Msa_MR_Genome_fa.png)

`Msa.MR.Genome.fa` is what will be used as the reference genome when we align to Msa.
It contains all of the tag clusters separated using strings of `A` for spacers.

``` bash
less Msa.MR.Clusters.fa
```

![Beginning of Msa.MR.Clusters.fa](calling_without_reference_images/Msa_MR_Clusters_fa.png)

`Msa.MR.Clusters.fa` isn't strictly needed for GBS-SNP-CROP, but is there in case
downstream software requires a reference genome.

``` bash
less PosToMask.txt
```

![Beginning of PosToMask.txt](calling_without_reference_images/PosToMask.png)

`PosToMask.txt` contains positions of all spacer nucleotides in `Msa.MR.Genome.fa`.

## Aligning reads to the mock reference and identifying SNPs

The next script in GBS-SNP-CROP uses the alignment software BWA to align every
sequence read to the mock reference that we just created.  It then uses SAMtools
to sort and index the alignments, then identify potential SNPs.  The `-q`, `-Q`,
`-f`, `-F`, and `-Opt` parameters get passed to SAMtools.

``` bash
#!/bin/bash

module load BWA
module load SAMtools
module load Perl

perl GBS-SNP-CROP-5.pl -bw bwa -st samtools -d SE -b Msa_barcodes.txt -ref Msa.MR.Genome.fa -Q 30 -q 0 -f 0 -F 2308 -t 4 -Opt 0
```

The most important output from this script is the set of files ending in
`.mpileup`, which contain information about how reads from each sample correspond
to alleles and loci.

![Mpileup output for JM-2014-H-28](calling_without_reference_images/mpileup.png)

At position 1 in the mock reference, where the reference base is a T, there were
8000 reads that aligned for JM-2014-H-28.  The rest of the information is not
particularly human-readable, however.

Another script then converts the information from the mpileup files into a
more readable format for further processing.  For the sake of having a smaller
dataset to look at in this tutorial, we will only look at SNPs and not indels
(`-p snp`).

``` bash
#!/bin/bash

module load Perl

perl GBS-SNP-CROP-6.pl -b Msa_barcodes.txt -out Msa.MasterMatrix.txt -p snp -t 4
```

![Count output for JM-2014-H-28](calling_without_reference_images/count.png)

Here we see how many reads corresponded to A, C, G, and T at each position.
Scrolling through the file we can see how depth varied from tag to tag.

![Count output for JM-2014-H-28](calling_without_reference_images/count2.PNG)

Additionally, one file was created combining this information from all samples.

![Master matrix of counts](calling_without_reference_images/mastermatrix.PNG)

## Filtering markers

Now we will filter the markers and output them to a genotype matrix.  Although
GBS-SNP-CROP only makes diploid gentoype calls, which we will not be using, we
still want to tweak the parameters so that it can make these calls, because it
will throw out SNPs below a given call rate.  Here are all the parameters and
why I set them as I did:

* `-mnHoDepth0 20`: The default value of `5` assumes diploidy, where a
heterozygote will have alleles in a 1:1 ratio, so the probability of a genotype
with five reads having only reads of the first allele is 3%.  In tetraploids
we will frequently see a 3:1 ratio, so we would need twelve reads to have that
same 3% probability.  Since we seem to have pretty high depth in this dataset,
I increased the number from 12 to 20.
* `-mnHoDepth1 50`: When there is just one read of an allele, that read might
be due to contamination, or it might indicate a true heterozygote.  Again,
because we expect heterozygotes that are 3:1 rather than 1:1, we'll set a
higher threshold than the default for determining that a read represents
contamination.
* `-mnHetDepth 3`: I didn't see a particular reason to change this from the
default.  If we see at least three reads, we can start to believe that the
allele is really there.
* `-altStrength 0.8`: Another one that I kept at the default.  If there are
multiple alternative alleles, we want most of them to belong to one allele
so that GBS-SNP-CROP can convert it to a biallelic marker.
* `-mnAlleleRatio 0.125`: This is used for catching paralogous loci that
were collapsed into single loci.  In a diploid system, we expect a depth
ratio of 0.5, but because it can vary due to random chance, the default is to
allow it to be as low as 0.25.  Since we have a tetraploid and are expecting
0.25, I'll lower it to 0.125 to allow sampling error.  If we see allele depth
ratios much below that, it could indicate that a locus isn't really just one
locus.
* `-mnCall 0.75`: The default is fine; wanting 75% of samples to have calls for
a given SNP is typical.
* `-mnAvgDepth 20 `: Since we have very high depth, we will set this value higher
than the default of `3`.
* `-mxAvgDepth 3000`: Since we have very high depth, we will set this value
higher than the default of `200`.

``` bash
#!/bin/bash

module load Perl

perl GBS-SNP-CROP-7.pl -in Msa.MasterMatrix.txt -out Msa.GenoMatrix.txt -p snp -mnHoDepth0 20 -mnHoDepth1 50 -mnHetDepth 3 -altStrength 0.8 -mnAlleleRatio 0.125 -mnCall 0.75 -mnAvgDepth 20 -mxAvgDepth 3000
```

We can see a log of what happened, showing that we got almost 38K SNPs.

![Log from step 7](calling_without_reference_images/genomat_log.PNG)

This is what the output genotype matrix looks like.  It contains genotype calls
and allelic read depth.

![Genotype matrix](calling_without_reference_images/GenoMatrix.png)

## Exporting to VCF

Finally, we'll convert the genotype matrix to a VCF.

``` bash
#!/bin/bash

module load Perl

perl GBS-SNP-CROP-8.pl -in variants/Msa.GenoMatrix.txt -out Msa_GSCdemo -b Msa_barcodes.txt -formats vcf
```

![Genotypes as VCF](calling_without_reference_images/vcf_output.png)

This is the file that we will keep and use for further analysis with other software.