# Running SwarmGenomics SwarmGenomics offers two ways to process whole-genome data, depending on your workflow preference and time constraints: **1. Step-by-Step Execution** – Run each module and step separately, allowing for greater flexibility in selecting specific analyses. This is ideal for customized workflows or focusing on particular aspects of the genome. **2. Fast-Track Execution** – Use pre-prepared scripts to automate multiple steps in a streamlined process. This option is designed to be easier and more efficient. Choose the method that best suits your research needs depending on , whether you prefer full control over each step or a faster, automated approach. Below, you'll find detailed instructions and scripts for the fast-track execution. Otherwise go to step **01. Getting started** for the slower step-by-step approach. ## Fast-Track SwarmGenomics The fast-track version consists of multiple Bash scripts that perform different population-genomic and assembly-based analyses (preprocessing, idxstats, heterozygosity, RoHs, PSMC, Circos, mitogenomes, etc.). All scripts are controlled through **a single shared parameter file (params.txt)**. You generally **do not edit the scripts themselves**. ### The params.txt file #### Purpose params.txt defines: - Where data live - Which tools are used - How analyses are parameterized It is sourced by every script using: ``` source params.txt ``` #### Sections: ##### Computational resources: ``` THREADS=10 ``` Controls how many CPU threads are used by tools that support multithreading. You can check how many CPU cores (threads) available with: ``` nproc ``` ##### Working directories ``` WORKING_DIR="/vol/storage/swarmgenomics" TOOL_DIR="/vol/storage/software" SCRIPTS="/vol/storage/scripts" ``` ```WORKING_DIR```: Root directory for all species and results ```TOOL_DIR```: Where third-party software is installed ```SCRIPTS```: Location of helper scripts (R, Python, plotting) These must be edited before running anything. ##### Species-specific directories (automatic) ``` FASTQ_DIR="${WORKING_DIR}/${SPECIES}/fastq" VCF_DIR="${WORKING_DIR}/${SPECIES}/vcf" RESULTS_DIR="${WORKING_DIR}/${SPECIES}/results" ``` You **do not edit** these manually. ```${SPECIES}``` is supplied as a command-line argument to each script Directories are created automatically if they don't exist. ##### Tool paths ``` BCFTOOLS="${TOOL_DIR}/bcftools-1.19" SRATOOLS="${TOOL_DIR}/sratoolkit.3.1.1-centos_linux64/bin" PSMC="${TOOL_DIR}/psmc/psmc" ``` Each script checks that required tools exist before running. If a tool path is wrong, the script will fail early with a clear error. ##### Analysis-specific parameters Each module has its own section, for example: - Preprocessing - Trimming thresholds - Variant calling filters - Heterozygosity - Number of scaffolds to plot - RoHs - Minimum genotype quality - PSMC - Iterations, time pattern, mutation rate ### Running the scripts The commands to run all the scripts follow this pattern: ``` ./script_name.bash params.txt ``` Examples of inputs: - Reference genome (.fna or .fna.gz) - BAM file - VCF file - SRA accession The **species name is always the first argument** and determines the working directory. #### Making scripts executable Before running any script: ``` chmod +x *.bash ``` If scripts were edited on Windows (typically shouldn't be): ``` dos2unix *.bash ``` #### Running long jobs Many analyses are computationally intensive and should be run in the background: ``` nohup ./script_name.bash ... params.txt & ``` Monitor progress with: ``` tail -f nohup.out ``` ## Preprocessing By now you should already have all the software installed as instructed in **0. Installations**. Before starting the [**preprocessing.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/preprocessing.bash) script, you should download the reference genome in FASTA format and the FASTQ files ### Downloading data This is an example with golden eagle genome. Change the details according to your species. You can use a text editor, such as Notepad, to edit the [**preprocessing.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/preprocessing.bash) script and save it. #### Create a directory for your species in your working directory When running SwarmGenomics, it's important to be in the right folder (directory) — this is called your working directory. It's the place where the commands will create files, download data, and look for input files. If you're not in the correct directory, things may not work as expected, or files might end up in the wrong place. In the SwarmGenomics course the working directory will be /vol/storage/swarmgenomics/your_name/ ``` mkdir golden_eagle ``` #### Change to that directory ``` cd golden_eagle ``` #### Download the reference genome from https://www.ncbi.nlm.nih.gov/genome/ When you find your species, go to FTP and copy the link address of the file that ends with *.fna.gz* ``` wget -O ./reference_genome.fna.gz wget -O /vol/storage/swarmGenomics/golden_eagle/ac_reference.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/496/995/GCF_900496995.4_bAquChr1.4/GCF_900496995.4_bAquChr1.4_genomic.fna.gz ``` #### Choose and download the SRA file from with accession number https://www.ncbi.nlm.nih.gov/sra Copy the run ID ``` prefetch ERR3316068 ``` If you get an error due to size use: ``` prefetch --verbose --force all accession_number --max-size 50G ``` ### Run preprocessing.bash with params.txt 1. Download the following files from the repository: - [**preprocessing.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/preprocessing.bash) - [**params.txt**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Parameters/params.txt) You should not edit the script itself. All paths and settings are controlled via params.txt. 2. Edit params.txt Open params.txt and update the required paths and resources. ``` # Check how many CPU cores (threads) available nproc ``` Upload the edited [**params.txt**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Parameters/params.txt) and [**preprocessing.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/preprocessing.bash) ``` # Make the script executable with: chmod +x preprocessing.bash # If the script is written or edited on Windows and then run on a Unix system, run: dos2unix preprocessing.bash ``` The scrip takes around 24h to run depending on available resources, so you should use nohup and & to keep the it running in the background without stopping if the terminal is closed. ``` nohup ./preprocessing.bash "species" "golden_eagle_reference.fna.gz" "SRA_file_ID" "params.txt" & # E.g with golden eagle nohup ./preprocessing.bash "golden_eagle" "/vol/storage/swarmgenomics/golden_eagle/golden_eagle_reference.fna.gz" "ERR3316068" "params.txt" & ``` ## Genomic analysis modules The genomic analysis modules follow the same principal and use the same parameters file. Each section, starting at [04. Genome features](https://github.com/AureKylmanen/Swarmgenomics/blob/main/04.%20Genome%20features.md), includes instructions on how to download the required tools and how to use the scripts.