# Genome Visualisation with Circos
This module generates a comprehensive circular visualization of the genome’s structural and sequence features using Circos. We focus on the 20 longest scaffolds to provide an informative overview of key genomic attributes, including coverage, GC content, repeat density, and heterozygosity.

## What you need
Input data
- **Reference genome** (.fna) with index file (.fai)
- **BAM file** of mapped reads (for coverage calculation)
- **VCF file** for heterozygosity
- **Repeat annotation file** (repeats.bed). Generated during the pipeline, but instructios also below
  
### Installations
You will need to download the [**circos.conf**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/circos.conf) file for plotting and download the required software.

Circos software: 
```
# Change to the directory where you want to install Circos
# E.g. /vol/storage/software
cd /vol/storage/software

# Get the latest circos download from https://circos.ca/software/download/
wget --no-check-certificate https://circos.ca/distribution/circos-0.69-9.tgz
tar -xvzf circos-0.69-9.tgz
```

For more instructions on installation visit https://circos.ca/software/installation/
```
# Install dependencies
# Install libraries
sudo apt-get install -y libgd-dev

# Install Perl GD module with Conda
conda install -c conda-forge perl-gd

# Install perl-Params-Validate module with Conda
conda install -c conda-forge perl-params-validate

```
Install Perl modules
```
# Enter CPAN shell 
cpan

# Install modules
install Readonly
install Font::TTF::Font
install Math::Bezier
install Math::Round
install Config::General
install GD
install Set::IntSpan
install List::MoreUtils
install GD::Polyline
install Math::VecStat
install SVG
install Params::Validate
install Regexp::Common
install Text::Format
install Statistics::Basic

# To exit
exit 
```
You will also need Bedtools and Bcftools, which you should have from before (otherwise check [00.Installations](https://github.com/AureKylmanen/Swarmgenomics/blob/main/0.%20Installations.md) for instructions) and mosdepth.
```
conda install bioconda::mosdepth
```
## Running circos.bash
You will need to download and edit [**params.txt**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Parameters/params.txt) file, and download [**circos.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/circos.bash), [**circos.conf**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/circos.conf), [**legend**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/legend.png), and [**overlay.py**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/overlay.py).

For the circos plot, you can choose 
- the bin size
- number of scaffolds visualised
- the naming method
	- ```LG``` for naming the scaffolds LG1, LG2, LG3 etc.
	- ```SCAFFOLD``` to keep the original scaffold names
	- ```PREFIX``` to select a preferred prefix such as chr, for chr1, chr2, chr3 (note this is based on order by size and may not reflect actual chromosome names)
	- ```CUSTOM```to include your own file with custom naming
		- Specified with ```CHR_NAME_MAP=""```
- Colour of the tracks
	- Please note that only the default colours work with the provided legend, but you will receive also a version of the plot without a legend.

Please also ensure that the path to the circos software is correct.


```
# ============================
# CIRCOS PLOT
# ============================

BIN_SIZE=100000

# Number of plotted scaffolds
TOP_SCAFFOLDS=20

# Options: LG | SCAFFOLD | PREFIX | CUSTOM
CHR_NAMING_MODE=LG

# Used only if CHR_NAMING_MODE=PREFIX
CHR_PREFIX=chr

# Used only if CHR_NAMING_MODE=CUSTOM
# Example: CHR_NAME_MAP=/path/to/chromosome_map.tsv
CHR_NAME_MAP=""

# CIRCOS colours
__KARYO_COL__="e4daed"
__GC_COL__="purple"
__COV_COL__="red"
__HET_COL__="greens-3-seq"
__REPEAT_COL__="blues-3-seq"

CIRCOS_BIN="/vol/storage/software/circos-0.69-9/bin/circos"
CIRCOS_CONF_TEMPLATE="${SCRIPTS}/circos.conf"

CIRCOS_OVERLAY_SCRIPT="${SCRIPTS}/overlay.py"
CIRCOS_LEGEND_IMAGE="${SCRIPTS}/legend.png"

REPEATS_BED="${WORKING_DIR}/${SPECIES}/repeats.bed"

# Circos include paths
CIRCOS_COLORS_CONF="${TOOL_DIR}/circos-0.69-9/etc/colors.conf"
CIRCOS_BREWER_CONF="${TOOL_DIR}/circos-0.69-9/etc/brewer.all.conf"
CIRCOS_FONTS_CONF="${TOOL_DIR}/circos-0.69-9/etc/fonts.conf"
CIRCOS_IMAGE_CONF="${TOOL_DIR}/circos-0.69-9/etc/image.conf"
CIRCOS_HOUSEKEEPING_CONF="${TOOL_DIR}/circos-0.69-9/etc/housekeeping.conf"
```

The script takes min and max values of GC content, coverage, heterozygosity and repeat density and uses those as the absolute min and max for the plotting.

Remember to ```chmod +x circos.bash```. Then run with:

```
./circos.bash "species" "path/to/reference_genome.fna.gz" "params.txt"
```

The results will be copied into the results directory within the species directory. If you wish to replot existing results, just run the script again. If you need to redo also the input files, remove the existing bins.bed, karyotype.txt, coverage.txt, gc_content.txt, repeat_density.txt, and heterozygosity_density.txt input files, before running the script again.

## Preparing data for plotting
In this section we will prepare the data files, which are used for plotting.
```
# Get top 20 longest scaffolds or a number of your choosing
sort -k2,2nr -k1,1 reference.fna.fai | head -n 20 > top20_scaffolds.fai

# Generate karyotype file for Circos (adjust colors as needed)
# This names each scaffold as LG1, LG2 etc.
# Change "e4daed" to any colour of your choosing
awk '{print "chr -", $1, "LG" NR, 0, $2, "e4daed"}' top20_scaffolds.fai > karyotype.txt

# If you want to keep scaffold names use:
# awk '{print "chr -", $1, $1, 0, $2, "e4daed"}' top20_scaffolds.fai > karyotype.txt

# Generate 100kb bins and sort them
bedtools makewindows -g top20_scaffolds.fai -w 100000 | sort -k1,1 -k2,2n > bins.bed

# Get coverage from the bam file eg.:
mosdepth --by 100000 -t 4 sample bwa.sorted.bam
zcat sample.regions.bed.gz > coverage.txt

# Get GC content from the reference genome
bedtools nuc -fi reference.fna -bed bins.bed | awk 'NR>1 {print $1, $2, $3, $5}' > gc_content.txt

# Get the repeat content from the repeats.bed file which was generated earlier
bedtools intersect -a bins.bed -b repeats.bed -c | awk '{bin_size = $3 - $2; density = $4 / bin_size; printf "%s\t%d\t%d\t%.6f\n", $1, $2, $3, density}' > repeat_density.txt

# If repeats.bed not generated before use
# perl -lne 'if(/^(>.*)/){ $head=$1 } else { $fa{$head} .= $_ } END{ foreach $s (sort(keys(%fa))){ print "$s\n$fa{$s}\n" }}' reference.fna | perl -lne 'if(/^>(\S+)/){ $n=$1} else { while(/([a-z]+)/g){ printf("%s\t%d\t%d\n",$n,pos($_)-length($1),pos($_)) } }'  > repeats.bed

# Get heterozygosity from the vcf file
bcftools view -g het -Ov output.vcf.gz > heterozygous.vcf
bedtools intersect -a bins.bed -b heterozygous.vcf -c > heterozygosity_counts.txt
awk '{bin_size = $3 - $2; density = $4 / bin_size; printf "%s\t%d\t%d\t%.6f\n", $1, $2, $3, density}' heterozygosity_counts.txt > heterozygosity_density.txt
```
Download [**circos_editable.conf**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/circos_editable.conf), file and edit the paths to the software if installed in a different location than /vol/storage/software. You may also change the colours of the different data tracks, to view options check https://circos.ca/documentation/tutorials/configuration/colors/lesson 

The circos.conf file has pre-set min and max values for the plots, but you can change them according to you data as guided below. To do this, first extract the actual values from your processed files. **For coverage**, it’s often best not to use the absolute maximum, as extreme outliers can skew the scale and make the rest of the data appear flat. Instead, choose a reasonable max value, such as 50 or 60, which corresponds to typical genome-wide average coverage. 
```
# Coverage min and max values
awk '{print $4}' coverage.txt | sort -n | awk 'NR==1 {min=$1} {max=$1} END {print "Min:", min, "Max:", max}'

# GC content min and max values 
awk '{print $4}' gc_content.txt | sort -n | awk 'NR==1 {min=$1} {max=$1} END {print "Min:", min, "Max:", max}'

# Heterozygosity min and max values
awk '{print $4}' heterozygosity_density.txt | sort -n | awk 'NR==1 {min=$1} {max=$1} END {print "Min:", min, "Max:", max}'

# Repeat min and max values
awk '{print $4}' repeat_density.txt | sort -n | awk 'NR==1 {min=$1} {max=$1} END {print "Min:", min, "Max:", max}'
```
## Running circos_editable.conf

```
# Edit the path to the software accrodingly
/vol/storage/software/circos-0.69-9/bin/circos -conf circos_editable.conf
```
To add a legend to the circos plot, dowload the [legend](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/legend.png) and [overlay.py](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/overlay.py) script
```
# Run overlay.py script
python overlay.py
```
## Output
The final output is a PNG of the Circos plot. It provides a circular visualization of the genome’s top 20 longest scaffolds, displaying multiple genomic features as concentric tracks. It includes GC content (purple line) showing the proportion of G/C bases per 100 kb bin, coverage (red histogram) representing sequencing depth with a configurable max value for better visualization, heterozygosity density (green heatmap) indicating genetic variation levels, and repeat density (blue heatmap) highlighting repetitive elements. These tracks help summarize complex genomic data in an intuitive and visually appealing way.