# Preparing directories and downloading data ## Objectives 1. Create directories for storing your data 2. Download the reference genome of your choice 3. Download the SRA file for your species 4. Prepare the data for mapping ## Downloading data This is an example with Golden eagle genome. Change the details according to your species. You can use a text editor, such as Notepad, to edit the script and save it. ### Choosing a Reference Genome on NCBI 1. Go to [NCBI Genome](https://www.ncbi.nlm.nih.gov/genome/) and search for your species of interest. 2. You may see several genome assemblies available. Look for: - Green badge → indicates the official Reference Genome. - RefSeq assemblies → preferred over GenBank, as they are curated and standardized by NCBI. - Chromosome-level assemblies → if available, these are higher quality than scaffold- or contig-level assemblies. 3. If multiple options exist, choose the most recent and well-annotated RefSeq assembly at chromosome level (if possible). #### Create a directory for your species (your working directory) ``` mkdir golden_eagle ``` #### Downloading the reference genome 1. Click on the chosen assembly and navigate to the FTP page: image 2. Right click the to copy the link to the file ending **_genomic.fna.gz** 3. In terminal, use the command wget to download the file through the link ``` # If you are trying out SwarmGenomics and using a single genome you can give the reference genome a new name, as below wget -O /vol/storage/swarmGenomics/golden_eagle/ac_reference.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/496/995/GCF_900496995.4_bAquChr1.4/GCF_900496995.4_bAquChr1.4_genomic.fna.gz # Or use the actual reference genome e.g.: wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/496/995/GCF_900496995.4_bAquChr1.4/GCF_900496995.4_bAquChr1.4_genomic.fna.gz ``` ### Choosing and downloading SRA file from https://www.ncbi.nlm.nih.gov/sra 1. Go to https://www.ncbi.nlm.nih.gov/sra and search for your species 2. Filter the results: - Type: genome - Library Layout: paired - Platform: Illumina - **Avoid metagenomics data** 3. Also look into the BioSample data to get infromation on: - Location (Geographical informationn, but also if a wild or zoo sample) - Tissue (blood and muscle etc.) - Sex 4. Choose a large file (more reads) to ensure high coverage for more reliable results 5. Use the accession number to download the file - This may take long due to the large size of the file - Once the file has downloaded, you should have one .SRA file in the directory ``` prefetch ERR3316068 ``` If you get an error due to size, edit --max-size to allow for bigger files if needed ``` prefetch --verbose --force all accession_number --max-size 70G ``` ## Starting preprocessing You can run through each individual script manually as below, or download and edit the [**preprocessing.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/preprocessing.bash) script to automatically perform the stages from 1-3. See instructions for that in 9001.Fast-track SwarmGenomics](https://github.com/AureKylmanen/Swarmgenomics/blob/main/001.%20Fast-track%20SwarmGenomics.md) Next steps will: 1. Convert the SRA file into two FASTQ files 2. Perform quality checks 3. Trim the low quality reads and adapter regions #### Converting into fastq Begin by converting the SRA file into more usable FASTQ files. ``` fasterq-dump --outdir /path/to/working_dir/fastq --split-files -e 13 /path/to/working_dir/SRA/file.SRA # Example: # fasterq-dump --outdir /vol/storage/swarmGenomics/golden_eagle/fastq --split-files -e 13 /vol/storage/swarmGenomics/golden_eagle/ERR3316068/ERR3316068.sra ``` #### Quality control Change the number of threads according to how many you have available. ``` fastqc -t 9 -o /vol/storage/swarmGenomics/golden_eagle -f fastq /vol/storage/swarmGenomics/golden_eagle/fastq/ERR3316068_1.fastq /vol/storage/swarmGenomics/golden_eagle/fastq/ERR3316068_2.fastq ``` #### Check output First copy it to your local storage. You can also use FileZilla (https://filezilla-project.org/) ``` scp -r -P 30189 -i '/path/to/your/private_key.txt' ubuntu@134.176.27.78:/vol/storage/swarmGenomics/golden_eagle/ERR3316068_1_fastqc* . ``` Then open the html file and have a look at the output. You can look up information on fastqc to see how to interpret the results. #### Trimming Trimmomatic removes low quality reads and adapter sequences. ``` trimmomatic PE -threads 13 /vol/storage/swarmGenomics/golden_eagle/fastq/ERR3316068_1.fastq /vol/storage/swarmGenomics/golden_eagle/fastq/ERR3316068_2.fastq /vol/storage/swarmGenomics/golden_eagle/fastq/ERR3316068_1_paired.fastq.gz /vol/storage/swarmGenomics/golden_eagle/fastq/ERR3316068_1_unpaired.fastq.gz /vol/storage/swarmGenomics/golden_eagle/fastq/ERR3316068_2_paired.fastq.gz /vol/storage/swarmGenomics/golden_eagle/fastq/ERR3316068_2_unpaired.fastq.gz SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:/vol/storage/software/trimmomatics/adapters.fa:2:30:10:2:keepBothReads ``` Have a look at http://www.usadellab.org/cms/index.php?page=trimmomatic if you're curious about the different steps. #### Second quality control Adapt the script from first quality control to run it with the output files from trimmomatics, and then move the files to your local storage for checking as you did before. #### Create a fastqc directory and move your fastqc files there Commands you need: ``` mkdir mv ```