# Mitochondrial Genome Reconstruction
The mitochondrial genome is a small, circular DNA found in mitochondria, distinct from the nuclear genome. It is maternally inherited and highly conserved, making it a valuable target for genetic studies.

The mitochrondial genome can be used, for example, in species identification, population genetics, forensic identification and biomedical purposes. Here we will use the constructed mitogenome in the identification of NUMTs in the following section.

## What you need
For mitogenome assembly, the only inputs required are **the paired-end FASTQ files** you created in the previous steps. FASTQ files are used because they contain the raw sequencing reads with quality scores, which GetOrganelle uses to assemble the mitochondrial genome. You can also convert BAM file back to FASTQ if needed.

### Installations
For the identification we will use GetOrganelle (https://github.com/Kinggerm/GetOrganelle).

```
# GetOrganelle installation
conda create -n getorganelle -c bioconda getorganelle
conda activate getorganelle
```
## Running mitogenome_assembly.bash
You will need to download and edit [**params.txt**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Parameters/params.txt) file, and download [**build_mt_datase.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/build_mt_database.bash), which downloads and builds a database, [**mitogenome.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/mitogenome.bash) for running all the analyses, and [**phylotree.py**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/phylotree.py) for plotting the phylogenetic tree.

The following parameters in params.txt control how getorganelle and blast are run, these generally don't need changing, unless you analyse a different organelle genome or need to increase the getorganelle read rounds. 
```
# ============================
# GETORGANELLE
# ============================
GETORGANELLE_TYPE="animal_mt"  # Type of organelle genome to assemble
GETORGANELLE_READ_ROUNDS=10    # Number of iterative read-extraction rounds

# ============================
# BLAST SETTINGS
# ============================

N_BLAST_HITS=5
BLAST_MAX_TARGET_SEQS=10 
BLAST_WORD_SIZE=11

# ============================
# ALIGNMENT AND TREE
# ============================
MAFFT_OPTS="--auto" 
FASTTREE_OPTS="-nt" 

PHYLOTREE_SCRIPT="$SCRIPTS/phylotree.py" 
```
Remember to ```chmod +x build_mt_database.bash mitogenome.bash```. Then first run:
```
./build_mt_database.bash "species" "params.txt"
```
Run the actual analysis script with nohup + & to run it in the background as the analyses may take hours to run:

```
nohup ./mitogenome.bash "species" "params.txt" &
```
The results will be copied into the results directory within the species directory.

## Mitogenome assembly and phylogenetic tree visualisation step-by-step
The input files you will use are the fastq files you created in the previous steps. Change the number of threads as necessary.
```
# Activate getorganelle
conda activate getorganelle

# Run get_organelle
# Change -t 26 to alter the number of threads
get_organelle_config.py -a animal_mt 
$ get_organelle_from_reads.py -1 DESTINATION_PATH/*_1_paired.fastq.gz \
   -2 DESTINATION_PATH/*_2_paired.fastq.gz -F animal_mt \
   -o DESTINATION_PATH/*_mitogenome -R 10 -t 26

# Example
get_organelle_config.py -a animal_mt 
get_organelle_from_reads.py -1 /vol/storage/swarmGenomics/golden_eagle/fastq/ERR3316068_1_paired.fastq.gz  \
   -2 /vol/storage/swarmGenomics/golden_eagle/fastq/ERR3316068_2_paired.fastq.gz -F animal_mt \
   -o /vol/storage/swarmGenomics/golden_eagle/ERR3316068_mitogenome -R 10 -t 26
```
The key output files include: \
``*.path_sequence.fasta``, each fasta file represents one type of genome structure \
``*.fastg``, the organelle related assembly graph to report for improvement and debug \
``*.selected_graph.gfa``, the organelle-only assembly graph \
``get_org.log.txt``, the log file 

## Phylogenetic Tree Visualisation
This section is designed to help you place the generated mitogenome or a reference mitogenome into a phylogenetic context by comparing it to a reference database of other mitochondrial sequences.

### Installations
You will need blastn, makeblastdb, seqkit, mafft and FastTree.

And for Python Bio, toytree and toyplot.
```
# Installations
sudo apt install ncbi-blast+
sudo apt install seqkit
sudo apt install mafft
sudo apt install fasttree
pip install toytree toyplot biopython
```

### Sequence Alignment and Phylogenetic Tree Construction
First download [genome.fna.gz](https://tu-dortmund.sciebo.de/s/N9Ljnt2cbFfnbMD) and prepare the mitogenome database with makeblastdb.

```
# Unzip the genome.fna.gz
gunzip genome.fna.gz

# Make database
makeblastdb -in genome.fna -dbtype nucl -out mito_db
```
	
The next step is to align the sequences and to generate the tree.

Copy the full script from below, and edit the species name and the mitogenome fasta (in ``` sed 's/^>.*/>species_name_query/' mitogenome.fasta > mito.fasta ```) and run it in a directory where you have the created database and the mitogenome fasta file.

```
# Find the longest (most complete) mitogenome in the fasta file
# Example:
# seqkit sort -l -r animal_mt.K115.scaffolds.graph1.1.path_sequence.fasta | seqkit head -n 1 > mito_longest.fasta
seqkit sort -l -r mitogenome.fasta | seqkit head -n 1 > mito_longest.fasta

# Rename the query FASTA header to a species-specific label
# This ensures the query is recognizable in the final tree (e.g. colored red)
# Example:
# sed 's/^>.*/>Ailuropoda_melanoleuca_query/' mito_longest.fasta > mito.fasta
sed 's/^>.*/>Your_species_query/' mito_longest.fasta > mito.fasta

# Run BLAST to find similar mitochondrial genomes
blastn -query mito.fasta -db mito_db \
  -outfmt "6 sseqid bitscore" \
  -max_target_seqs 50 -num_threads 4 > blast_hits.tsv

# Sort hits by score, keep top 10 unique hits
sort -k2,2nr blast_hits.tsv | awk '!seen[$1]++' | head -n 10 | cut -f1 > top10.ids

# Extract corresponding sequences from reference genome database
seqkit grep -f top10.ids /vol/storage/genome.fna > top10_hits.fasta

# Add the renamed query sequence to the FASTA file
cat mito.fasta >> top10_hits.fasta

# Align all sequences with MAFFT
mafft --auto top10_hits.fasta > aligned.fasta

# Build a maximum-likelihood tree with FastTree
FastTree -nt aligned.fasta > tree.nwk
```
### Plotting the Phylogenetic Tree
Download or copy the [**phylotree_editable.py**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/phylotree_editable.py) script and run it to plot the tree, you may need to alter the size specifications to ensure the plot fits on the PNG.
```
python phylotree_editable.py
```
Or activate python and paste
```
import toytree
import toyplot.png
from Bio import SeqIO

# 1. Make an ID -> full header dictionary for the database
db_fasta = "genome.fna"
id2header = {}
for record in SeqIO.parse(db_fasta, "fasta"):
    # Map only the first word after ">" to the full description
    id2header[record.id] = record.description

# 2. Add the query ID and header from mito.fasta
query_fasta = "mito.fasta"
for record in SeqIO.parse(query_fasta, "fasta"):
    query_id = record.id
    query_header = record.description
    id2header[query_id] = query_header

# 3. Load the tree and get its tip labels (IDs only)
tree = toytree.tree("tree.nwk")
tip_ids = tree.get_tip_labels()  # These are just the IDs

# 4. For each tip, use the full header as the label and color the query tip red
tip_labels = [id2header.get(tip, tip) for tip in tip_ids]
colors = ["red" if tip == query_id else "black" for tip in tip_ids]

canvas = tree.draw(
    width=1000,
    height=600,
    tip_labels=tip_labels,
    tip_labels_colors=colors,
)[0]

toyplot.png.render(canvas, "mito_tree.png")
```