# Genome features Understanding alignment quality and genome representation across chromosomes is essential in evaluating sequencing results. This analysis uses samtools idxstats output to visualize: - Chromosome length - Unmapped reads per megabase - Estimated coverage These summaries help identify uneven sequencing, poorly assembled scaffolds, or mapping issues that may impact downstream analyses like variant calling or population genomics. ## What you need The only input needed is **a BAM file**, either one you already have or one generated using the SwarmGenomics pipeline. ### Installations You only need samtools and R installed with the following packages: ```r install.packages(c("ggplot2", "dplyr", "gridExtra", "RColorBrewer", "patchwork")) ``` ## Running genome_features.bash Download [**genome_features.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/genome_features.bash), [**idxstats.R**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/idxstats.R) and [**params.txt**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Parameters/params.txt) file, you need edit the params.txt to change the directories and you may change the number of scaffolds and the colours of your plots. ``` # ============================ # IDXSTATS PLOTTING # ============================ # idxstats.R location idxstats_PLOT="${SCRIPTS}/idxstats.R" # Number of scaffolds to plot IDXSTATS_TOP_SCAFFOLDS=20 # Color palette (RColorBrewer palette name) IDXSTATS_COLOR_PALETTE="Set2" # Optional: manually override colors (comma-separated hex or names) # Leave empty to use palette IDXSTATS_CUSTOM_COLORS="" ``` Remember to ```chmod +x genome_features.bash```. Then run with: ``` ./genome_features.bash "species" "params.txt" ``` The results will be copied into the results directory within the species directory. ## Running idxstats step-by-step Alternatively you may run the analysis step by step as instructed below. ``` # Perform analysis samtools idxstats bwa.sorted.bam > idxstats_output.txt ``` ``` # Estimate typical read length from the bam file # Take note of the length samtools view bwa.sorted.bam | head -n 1000 | \ awk '{if($10 != "*") print length($10)}' | \ sort | uniq -c | sort -nr | head -n1 | awk '{print $2}' ``` ``` # Convert idxstats output into csv format for easier plotting awk 'BEGIN {OFS=","; print "chrom,length,mapped,unmapped"} {print $1,$2,$3,$4}' \ idxstats_output.txt > idxstats_clean.csv ``` Download the script [**idxstats_editable.R**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/idxstats_editable.R) for plotting the results. The script plots 20 longest scaffolds. If you wish to change the number of plotted scaffolds, edit or remove this line within the script: ``` slice(1:20) %>% ``` To plot, use the read lenght obtained earlier. ``` # Change the read length according to your estimate Rscript idxstats_editable.R idxstats_clean.csv # For example Rscript idxstats_editable.R idxstats_clean.csv 150 ``` ## Output The script outputs a single PNG image showing the top 20 longest scaffolds with their lengths, unmapped reads per megabase, and estimated coverage, helping visualize mapping quality and assembly structure.