# NUMT Identification

NUMTs, or Nuclear Mitochondrial DNA segments, are fragments of mitochondrial DNA (mtDNA) that have been inserted into the nuclear genome. These sequences arise through the transfer of mtDNA to the nucleus and can be found in many eukaryotic organisms. 
The identification is done using the mitochondrial sequences identified in the mitochondrial reconstruction step and blasting them against the reference genome. This will create a fasta file with identified NUMTs.

## What you need
- **A reference nuclear genome**

- **A mitogenome** (assembled in the previous step with GetOrganelle, or downloaded from a database)

### Installations
Here we again use miniconda for installing R, Blast and Bedtools. You will also need to save the [**get_bed.R**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/get_bed2.R) script in your working directory.
```
# R 
conda create -n r_env r-essentials r-base
conda activate r_env

# Blast
conda install bioconda::blast

# Bedtools (you should already have this installed)
conda install bioconda::bedtools

# R installations
# Activate R
R

# Install packages
install.packages(c(
  "data.table",
  "dplyr",
  "stringr",
  "circlize"
))

install.packages("BiocManager")

BiocManager::install("Biostrings")
```

## Running numt_detection.bash
You will need to download and edit [**params.txt**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Parameters/params.txt) file, and download [**numt_detection.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/numt_detection.bash) and [**numt_visual.04.R**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/numt_visual.04.R)

You may change any of the parameter's below in the params.txt to run the NUMT detection. 

```
# ============================
# NUMT DETECTION
# ============================
NUMT_BLAST_OUT="${WORKING_DIR}/${SPECIES}/${SPECIES}.blast.out"
NUMT_BLAST_CLEAN="${WORKING_DIR}/${SPECIES}/${SPECIES}.blast.clean.out"
NUMT_SEP_FA="${WORKING_DIR}/${SPECIES}/${SPECIES}.sep.fa"

# BLAST parameters for NUMT detection
NUMT_WORD_SIZE=20        # Word size for BLAST search (lower = more sensitive)
NUMT_EVALUE=1e-10        # E-value threshold (lower = more stringent)

# R script for NUMT visualization
NUMT_PLOT_SCRIPT="${SCRIPTS}/numt_visual.04.R"

```

Remember to ```chmod +x numt_detection.bash```. Then run with:
```
./numt_detection.bash "species" "path/to/reference_genome.fna.gz" "params.txt"
```
The results will be copied into the results directory within the species directory.

## Numt detection step-by-step
Make sure your get_bed.R sript has been saved in your working directory. You may also download the script [**runNUMTidentification2.3.cyc.py**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/runNUMTidentification2.3.cyc.py), edit the paths and run it.
```
# Initiate python
python

#Edit the following script and run it with python
#Example
import os,sys
import csv
import pandas as pd
import re

dir_work = '/vol/storage/swarmGenomics/golden_eagle/'
file_mito = '/vol/storage/swarmGenomics/golden_eagle/mt/ERR3316068_mitogenome/animal_mt.K115.scaffolds.graph1.1.path_sequence.fasta'
file_genome = '/vol/storage/swarmGenomics/golden_eagle/acreference.fna'
file_bedR = '/vol/storage/numt/get_bed2.R'
species = 'Aquila_Chrysaetos'

os.system('makeblastdb -in ' + file_mito + ' -parse_seqids -dbtype nucl')
os.system('makeblastdb -in ' + file_genome + ' -parse_seqids -dbtype nucl')

os.system('blastn -db '+ file_genome + ' -query '+ file_mito +' -outfmt 7 -word_size 20 -num_threads 1 -out ' + dir_work + species +'.blast.out')
os.system("grep -v '#' " + dir_work + species +".blast.out | cut -d '\t' -f2,3,4,9,10 > " + dir_work + species + ".blast.clean.out")

os.system('ls -1 ' + dir_work + species + '.blast.clean.out > ' + dir_work + 'files_list.txt')
os.system('cp ' + file_bedR + ' ' + dir_work)
os.system('Rscript get_bed2.R')
os.system('rm get_bed2.R')

os.system('bedtools getfasta -fi '+ file_genome +' -bed ' + dir_work + species + '.blast.clean.out_AllFrags.bed -fo ' + dir_work + species + '.sep.fa')

#Exit python
exit()
```
## Results

### Look up how many NUMTs
The output is a file with identified NUMTs called species_name.sep.fa (e.g. Aquila_Chrysaetos.sep.fa).
You can extract the number of NUMTs with grep:
```
#This counts the occurance of >, which is the start of each sequence
grep -o ">" /vol/storage/swarmGenomics/golden_eagle/numt/Aquila_Chrysaetos.sep.fa | wc -l
```

### Plot results
These steps will plot a circos plot with mitogenome and numts.

Prepare input file:
```
# Extract numts from .sep.fa file
# Example:
# grep '>' Ailuropoda_Melanoleuca.sep.fa  | sed -e 's/>//g' > numt.lst
grep '>' species.sep.fa  | sed -e 's/>//g' > numt.lst

# Extract numt locations
# Example:
# awk '!/^#/ {if ($10 > $9) {print $2 ":" $9 "-" $10, $7, $8} else if ($9 > $10) {print $2 ":" $10 "-" $9, $7, $8}}' Ailuropoda_Melanoleuca.blast.out> blast_chk.lst
awk '!/^#/ {if ($10 > $9) {print $2 ":" $9 "-" $10, $7, $8} else if ($9 > $10) {print $2 ":" $10 "-" $9, $7, $8}}' species.blast.out > blast_chk.lst

# sort and join outputs
sort numt.lst -o numt.lst
sort -u blast_chk.lst -o blast_chk.lst
join -1 1 -2 1 -o 1.1 2.2 2.3 numt.lst blast_chk.lst > mt_loc.lst
```

Download and edit the **numt_visual.03.R** script to match your input files and working directory, you will need the names of your *.path_sequence.fasta and *.csv files (e.g. animal_mt.K115.complete.graph1.1.path_sequence.fasta and extended_K115.assembly_graph.fastg.extend-animal_mt.csv

```
... 
library(Biostrings)
library(circlize)
library(stringr)

dir_work <- '/path/to/working/dir/' # change this
setwd(dir_work)

# mito genes annotations
mt_seq <- readDNAStringSet('*.path_sequence.fasta') # change *.path_sequence.fasta to e.g. animal_mt.K115.complete.graph1.1.path_sequence.fasta
len <- width(mt_seq)

mt_info <- read.table('*.csv', header = T, stringsAsFactors = F) # change *.csv to e.g. extended_K115.assembly_graph.fastg.extend-animal_mt.csv
mt_region <- mt_info$details

...
```
Run the script
```
Rscript numt_visual.03.R
```
## NUMT results
This script produces a circular visualization of NUMTs mapped onto the mitochondrial genome. The outer ring represents the mitochondrial genome, showing annotated protein-coding genes, rRNAs, and non-coding regions in different colors, with gene names labeled around the circle. The inner track displays NUMTs as horizontal bars, each positioned according to its BLAST alignment coordinates on the mitochondrial genome and stacked vertically.