# NUMT Identification NUMTs, or Nuclear Mitochondrial DNA segments, are fragments of mitochondrial DNA (mtDNA) that have been inserted into the nuclear genome. These sequences arise through the transfer of mtDNA to the nucleus and can be found in many eukaryotic organisms. The identification is done using the mitochondrial sequences identified in the mitochondrial reconstruction step and blasting them against the reference genome. This will create a fasta file with identified NUMTs. ## What you need - **A reference nuclear genome** - **A mitogenome** (assembled in the previous step with GetOrganelle, or downloaded from a database) ### Installations Here we again use miniconda for installing R, Blast and Bedtools. You will also need to save the [**get_bed.R**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/get_bed2.R) script in your working directory. ``` # R conda create -n r_env r-essentials r-base conda activate r_env # Blast conda install bioconda::blast # Bedtools (you should already have this installed) conda install bioconda::bedtools # R installations # Activate R R # Install packages install.packages(c( "data.table", "dplyr", "stringr", "circlize" )) install.packages("BiocManager") BiocManager::install("Biostrings") ``` ## Running numt_detection.bash You will need to download and edit [**params.txt**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Parameters/params.txt) file, and download [**numt_detection.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/numt_detection.bash) and [**numt_visual.04.R**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/numt_visual.04.R) You may change any of the parameter's below in the params.txt to run the NUMT detection. ``` # ============================ # NUMT DETECTION # ============================ NUMT_BLAST_OUT="${WORKING_DIR}/${SPECIES}/${SPECIES}.blast.out" NUMT_BLAST_CLEAN="${WORKING_DIR}/${SPECIES}/${SPECIES}.blast.clean.out" NUMT_SEP_FA="${WORKING_DIR}/${SPECIES}/${SPECIES}.sep.fa" # BLAST parameters for NUMT detection NUMT_WORD_SIZE=20 # Word size for BLAST search (lower = more sensitive) NUMT_EVALUE=1e-10 # E-value threshold (lower = more stringent) # R script for NUMT visualization NUMT_PLOT_SCRIPT="${SCRIPTS}/numt_visual.04.R" ``` Remember to ```chmod +x numt_detection.bash```. Then run with: ``` ./numt_detection.bash "species" "path/to/reference_genome.fna.gz" "params.txt" ``` The results will be copied into the results directory within the species directory. ## Numt detection step-by-step Make sure your get_bed.R sript has been saved in your working directory. You may also download the script [**runNUMTidentification2.3.cyc.py**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/runNUMTidentification2.3.cyc.py), edit the paths and run it. ``` # Initiate python python #Edit the following script and run it with python #Example import os,sys import csv import pandas as pd import re dir_work = '/vol/storage/swarmGenomics/golden_eagle/' file_mito = '/vol/storage/swarmGenomics/golden_eagle/mt/ERR3316068_mitogenome/animal_mt.K115.scaffolds.graph1.1.path_sequence.fasta' file_genome = '/vol/storage/swarmGenomics/golden_eagle/acreference.fna' file_bedR = '/vol/storage/numt/get_bed2.R' species = 'Aquila_Chrysaetos' os.system('makeblastdb -in ' + file_mito + ' -parse_seqids -dbtype nucl') os.system('makeblastdb -in ' + file_genome + ' -parse_seqids -dbtype nucl') os.system('blastn -db '+ file_genome + ' -query '+ file_mito +' -outfmt 7 -word_size 20 -num_threads 1 -out ' + dir_work + species +'.blast.out') os.system("grep -v '#' " + dir_work + species +".blast.out | cut -d '\t' -f2,3,4,9,10 > " + dir_work + species + ".blast.clean.out") os.system('ls -1 ' + dir_work + species + '.blast.clean.out > ' + dir_work + 'files_list.txt') os.system('cp ' + file_bedR + ' ' + dir_work) os.system('Rscript get_bed2.R') os.system('rm get_bed2.R') os.system('bedtools getfasta -fi '+ file_genome +' -bed ' + dir_work + species + '.blast.clean.out_AllFrags.bed -fo ' + dir_work + species + '.sep.fa') #Exit python exit() ``` ## Results ### Look up how many NUMTs The output is a file with identified NUMTs called species_name.sep.fa (e.g. Aquila_Chrysaetos.sep.fa). You can extract the number of NUMTs with grep: ``` #This counts the occurance of >, which is the start of each sequence grep -o ">" /vol/storage/swarmGenomics/golden_eagle/numt/Aquila_Chrysaetos.sep.fa | wc -l ``` ### Plot results These steps will plot a circos plot with mitogenome and numts. Prepare input file: ``` # Extract numts from .sep.fa file # Example: # grep '>' Ailuropoda_Melanoleuca.sep.fa | sed -e 's/>//g' > numt.lst grep '>' species.sep.fa | sed -e 's/>//g' > numt.lst # Extract numt locations # Example: # awk '!/^#/ {if ($10 > $9) {print $2 ":" $9 "-" $10, $7, $8} else if ($9 > $10) {print $2 ":" $10 "-" $9, $7, $8}}' Ailuropoda_Melanoleuca.blast.out> blast_chk.lst awk '!/^#/ {if ($10 > $9) {print $2 ":" $9 "-" $10, $7, $8} else if ($9 > $10) {print $2 ":" $10 "-" $9, $7, $8}}' species.blast.out > blast_chk.lst # sort and join outputs sort numt.lst -o numt.lst sort -u blast_chk.lst -o blast_chk.lst join -1 1 -2 1 -o 1.1 2.2 2.3 numt.lst blast_chk.lst > mt_loc.lst ``` Download and edit the **numt_visual.03.R** script to match your input files and working directory, you will need the names of your *.path_sequence.fasta and *.csv files (e.g. animal_mt.K115.complete.graph1.1.path_sequence.fasta and extended_K115.assembly_graph.fastg.extend-animal_mt.csv ``` ... library(Biostrings) library(circlize) library(stringr) dir_work <- '/path/to/working/dir/' # change this setwd(dir_work) # mito genes annotations mt_seq <- readDNAStringSet('*.path_sequence.fasta') # change *.path_sequence.fasta to e.g. animal_mt.K115.complete.graph1.1.path_sequence.fasta len <- width(mt_seq) mt_info <- read.table('*.csv', header = T, stringsAsFactors = F) # change *.csv to e.g. extended_K115.assembly_graph.fastg.extend-animal_mt.csv mt_region <- mt_info$details ... ``` Run the script ``` Rscript numt_visual.03.R ``` ## NUMT results This script produces a circular visualization of NUMTs mapped onto the mitochondrial genome. The outer ring represents the mitochondrial genome, showing annotated protein-coding genes, rRNAs, and non-coding regions in different colors, with gene names labeled around the circle. The inner track displays NUMTs as horizontal bars, each positioned according to its BLAST alignment coordinates on the mitochondrial genome and stacked vertically.