# Genome Visualisation with Circos This module generates a comprehensive circular visualization of the genome’s structural and sequence features using Circos. We focus on the 20 longest scaffolds to provide an informative overview of key genomic attributes, including coverage, GC content, repeat density, and heterozygosity. ## What you need Input data - **Reference genome** (.fna) with index file (.fai) - **BAM file** of mapped reads (for coverage calculation) - **VCF file** for heterozygosity - **Repeat annotation file** (repeats.bed). Generated during the pipeline, but instructios also below ### Installations You will need to download the [**circos.conf**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/circos.conf) file for plotting and download the required software. Circos software: ``` # Change to the directory where you want to install Circos # E.g. /vol/storage/software cd /vol/storage/software # Get the latest circos download from https://circos.ca/software/download/ wget --no-check-certificate https://circos.ca/distribution/circos-0.69-9.tgz tar -xvzf circos-0.69-9.tgz ``` For more instructions on installation visit https://circos.ca/software/installation/ ``` # Install dependencies # Install libraries sudo apt-get install -y libgd-dev # Install Perl GD module with Conda conda install -c conda-forge perl-gd # Install perl-Params-Validate module with Conda conda install -c conda-forge perl-params-validate ``` Install Perl modules ``` # Enter CPAN shell cpan # Install modules install Readonly install Font::TTF::Font install Math::Bezier install Math::Round install Config::General install GD install Set::IntSpan install List::MoreUtils install GD::Polyline install Math::VecStat install SVG install Params::Validate install Regexp::Common install Text::Format install Statistics::Basic # To exit exit ``` You will also need Bedtools and Bcftools, which you should have from before (otherwise check [00.Installations](https://github.com/AureKylmanen/Swarmgenomics/blob/main/0.%20Installations.md) for instructions) and mosdepth. ``` conda install bioconda::mosdepth ``` ## Running circos.bash You will need to download and edit [**params.txt**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Parameters/params.txt) file, and download [**circos.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/circos.bash), [**circos.conf**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/circos.conf), [**legend**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/legend.png), and [**overlay.py**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/overlay.py). For the circos plot, you can choose - the bin size - number of scaffolds visualised - the naming method - ```LG``` for naming the scaffolds LG1, LG2, LG3 etc. - ```SCAFFOLD``` to keep the original scaffold names - ```PREFIX``` to select a preferred prefix such as chr, for chr1, chr2, chr3 (note this is based on order by size and may not reflect actual chromosome names) - ```CUSTOM```to include your own file with custom naming - Specified with ```CHR_NAME_MAP=""``` - Colour of the tracks - Please note that only the default colours work with the provided legend, but you will receive also a version of the plot without a legend. Please also ensure that the path to the circos software is correct. ``` # ============================ # CIRCOS PLOT # ============================ BIN_SIZE=100000 # Number of plotted scaffolds TOP_SCAFFOLDS=20 # Options: LG | SCAFFOLD | PREFIX | CUSTOM CHR_NAMING_MODE=LG # Used only if CHR_NAMING_MODE=PREFIX CHR_PREFIX=chr # Used only if CHR_NAMING_MODE=CUSTOM # Example: CHR_NAME_MAP=/path/to/chromosome_map.tsv CHR_NAME_MAP="" # CIRCOS colours __KARYO_COL__="e4daed" __GC_COL__="purple" __COV_COL__="red" __HET_COL__="greens-3-seq" __REPEAT_COL__="blues-3-seq" CIRCOS_BIN="/vol/storage/software/circos-0.69-9/bin/circos" CIRCOS_CONF_TEMPLATE="${SCRIPTS}/circos.conf" CIRCOS_OVERLAY_SCRIPT="${SCRIPTS}/overlay.py" CIRCOS_LEGEND_IMAGE="${SCRIPTS}/legend.png" REPEATS_BED="${WORKING_DIR}/${SPECIES}/repeats.bed" # Circos include paths CIRCOS_COLORS_CONF="${TOOL_DIR}/circos-0.69-9/etc/colors.conf" CIRCOS_BREWER_CONF="${TOOL_DIR}/circos-0.69-9/etc/brewer.all.conf" CIRCOS_FONTS_CONF="${TOOL_DIR}/circos-0.69-9/etc/fonts.conf" CIRCOS_IMAGE_CONF="${TOOL_DIR}/circos-0.69-9/etc/image.conf" CIRCOS_HOUSEKEEPING_CONF="${TOOL_DIR}/circos-0.69-9/etc/housekeeping.conf" ``` The script takes min and max values of GC content, coverage, heterozygosity and repeat density and uses those as the absolute min and max for the plotting. Remember to ```chmod +x circos.bash```. Then run with: ``` ./circos.bash "species" "path/to/reference_genome.fna.gz" "params.txt" ``` The results will be copied into the results directory within the species directory. If you wish to replot existing results, just run the script again. If you need to redo also the input files, remove the existing bins.bed, karyotype.txt, coverage.txt, gc_content.txt, repeat_density.txt, and heterozygosity_density.txt input files, before running the script again. ## Preparing data for plotting In this section we will prepare the data files, which are used for plotting. ``` # Get top 20 longest scaffolds or a number of your choosing sort -k2,2nr -k1,1 reference.fna.fai | head -n 20 > top20_scaffolds.fai # Generate karyotype file for Circos (adjust colors as needed) # This names each scaffold as LG1, LG2 etc. # Change "e4daed" to any colour of your choosing awk '{print "chr -", $1, "LG" NR, 0, $2, "e4daed"}' top20_scaffolds.fai > karyotype.txt # If you want to keep scaffold names use: # awk '{print "chr -", $1, $1, 0, $2, "e4daed"}' top20_scaffolds.fai > karyotype.txt # Generate 100kb bins and sort them bedtools makewindows -g top20_scaffolds.fai -w 100000 | sort -k1,1 -k2,2n > bins.bed # Get coverage from the bam file eg.: mosdepth --by 100000 -t 4 sample bwa.sorted.bam zcat sample.regions.bed.gz > coverage.txt # Get GC content from the reference genome bedtools nuc -fi reference.fna -bed bins.bed | awk 'NR>1 {print $1, $2, $3, $5}' > gc_content.txt # Get the repeat content from the repeats.bed file which was generated earlier bedtools intersect -a bins.bed -b repeats.bed -c | awk '{bin_size = $3 - $2; density = $4 / bin_size; printf "%s\t%d\t%d\t%.6f\n", $1, $2, $3, density}' > repeat_density.txt # If repeats.bed not generated before use # perl -lne 'if(/^(>.*)/){ $head=$1 } else { $fa{$head} .= $_ } END{ foreach $s (sort(keys(%fa))){ print "$s\n$fa{$s}\n" }}' reference.fna | perl -lne 'if(/^>(\S+)/){ $n=$1} else { while(/([a-z]+)/g){ printf("%s\t%d\t%d\n",$n,pos($_)-length($1),pos($_)) } }' > repeats.bed # Get heterozygosity from the vcf file bcftools view -g het -Ov output.vcf.gz > heterozygous.vcf bedtools intersect -a bins.bed -b heterozygous.vcf -c > heterozygosity_counts.txt awk '{bin_size = $3 - $2; density = $4 / bin_size; printf "%s\t%d\t%d\t%.6f\n", $1, $2, $3, density}' heterozygosity_counts.txt > heterozygosity_density.txt ``` Download [**circos_editable.conf**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/circos_editable.conf), file and edit the paths to the software if installed in a different location than /vol/storage/software. You may also change the colours of the different data tracks, to view options check https://circos.ca/documentation/tutorials/configuration/colors/lesson The circos.conf file has pre-set min and max values for the plots, but you can change them according to you data as guided below. To do this, first extract the actual values from your processed files. **For coverage**, it’s often best not to use the absolute maximum, as extreme outliers can skew the scale and make the rest of the data appear flat. Instead, choose a reasonable max value, such as 50 or 60, which corresponds to typical genome-wide average coverage. ``` # Coverage min and max values awk '{print $4}' coverage.txt | sort -n | awk 'NR==1 {min=$1} {max=$1} END {print "Min:", min, "Max:", max}' # GC content min and max values awk '{print $4}' gc_content.txt | sort -n | awk 'NR==1 {min=$1} {max=$1} END {print "Min:", min, "Max:", max}' # Heterozygosity min and max values awk '{print $4}' heterozygosity_density.txt | sort -n | awk 'NR==1 {min=$1} {max=$1} END {print "Min:", min, "Max:", max}' # Repeat min and max values awk '{print $4}' repeat_density.txt | sort -n | awk 'NR==1 {min=$1} {max=$1} END {print "Min:", min, "Max:", max}' ``` ## Running circos_editable.conf ``` # Edit the path to the software accrodingly /vol/storage/software/circos-0.69-9/bin/circos -conf circos_editable.conf ``` To add a legend to the circos plot, dowload the [legend](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/legend.png) and [overlay.py](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/overlay.py) script ``` # Run overlay.py script python overlay.py ``` ## Output The final output is a PNG of the Circos plot. It provides a circular visualization of the genome’s top 20 longest scaffolds, displaying multiple genomic features as concentric tracks. It includes GC content (purple line) showing the proportion of G/C bases per 100 kb bin, coverage (red histogram) representing sequencing depth with a configurable max value for better visualization, heterozygosity density (green heatmap) indicating genetic variation levels, and repeat density (blue heatmap) highlighting repetitive elements. These tracks help summarize complex genomic data in an intuitive and visually appealing way.