# Mitochondrial Genome Reconstruction The mitochondrial genome is a small, circular DNA found in mitochondria, distinct from the nuclear genome. It is maternally inherited and highly conserved, making it a valuable target for genetic studies. The mitochrondial genome can be used, for example, in species identification, population genetics, forensic identification and biomedical purposes. Here we will use the constructed mitogenome in the identification of NUMTs in the following section. ## What you need For mitogenome assembly, the only inputs required are **the paired-end FASTQ files** you created in the previous steps. FASTQ files are used because they contain the raw sequencing reads with quality scores, which GetOrganelle uses to assemble the mitochondrial genome. You can also convert BAM file back to FASTQ if needed. ### Installations For the identification we will use GetOrganelle (https://github.com/Kinggerm/GetOrganelle). ``` # GetOrganelle installation conda create -n getorganelle -c bioconda getorganelle conda activate getorganelle ``` ## Running mitogenome_assembly.bash You will need to download and edit [**params.txt**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Parameters/params.txt) file, and download [**build_mt_datase.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/build_mt_database.bash), which downloads and builds a database, [**mitogenome.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/mitogenome.bash) for running all the analyses, and [**phylotree.py**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/phylotree.py) for plotting the phylogenetic tree. The following parameters in params.txt control how getorganelle and blast are run, these generally don't need changing, unless you analyse a different organelle genome or need to increase the getorganelle read rounds. ``` # ============================ # GETORGANELLE # ============================ GETORGANELLE_TYPE="animal_mt" # Type of organelle genome to assemble GETORGANELLE_READ_ROUNDS=10 # Number of iterative read-extraction rounds # ============================ # BLAST SETTINGS # ============================ N_BLAST_HITS=5 BLAST_MAX_TARGET_SEQS=10 BLAST_WORD_SIZE=11 # ============================ # ALIGNMENT AND TREE # ============================ MAFFT_OPTS="--auto" FASTTREE_OPTS="-nt" PHYLOTREE_SCRIPT="$SCRIPTS/phylotree.py" ``` Remember to ```chmod +x build_mt_database.bash mitogenome.bash```. Then first run: ``` ./build_mt_database.bash "species" "params.txt" ``` Run the actual analysis script with nohup + & to run it in the background as the analyses may take hours to run: ``` nohup ./mitogenome.bash "species" "params.txt" & ``` The results will be copied into the results directory within the species directory. ## Mitogenome assembly and phylogenetic tree visualisation step-by-step The input files you will use are the fastq files you created in the previous steps. Change the number of threads as necessary. ``` # Activate getorganelle conda activate getorganelle # Run get_organelle # Change -t 26 to alter the number of threads get_organelle_config.py -a animal_mt $ get_organelle_from_reads.py -1 DESTINATION_PATH/*_1_paired.fastq.gz \ -2 DESTINATION_PATH/*_2_paired.fastq.gz -F animal_mt \ -o DESTINATION_PATH/*_mitogenome -R 10 -t 26 # Example get_organelle_config.py -a animal_mt get_organelle_from_reads.py -1 /vol/storage/swarmGenomics/golden_eagle/fastq/ERR3316068_1_paired.fastq.gz \ -2 /vol/storage/swarmGenomics/golden_eagle/fastq/ERR3316068_2_paired.fastq.gz -F animal_mt \ -o /vol/storage/swarmGenomics/golden_eagle/ERR3316068_mitogenome -R 10 -t 26 ``` The key output files include: \ ``*.path_sequence.fasta``, each fasta file represents one type of genome structure \ ``*.fastg``, the organelle related assembly graph to report for improvement and debug \ ``*.selected_graph.gfa``, the organelle-only assembly graph \ ``get_org.log.txt``, the log file ## Phylogenetic Tree Visualisation This section is designed to help you place the generated mitogenome or a reference mitogenome into a phylogenetic context by comparing it to a reference database of other mitochondrial sequences. ### Installations You will need blastn, makeblastdb, seqkit, mafft and FastTree. And for Python Bio, toytree and toyplot. ``` # Installations sudo apt install ncbi-blast+ sudo apt install seqkit sudo apt install mafft sudo apt install fasttree pip install toytree toyplot biopython ``` ### Sequence Alignment and Phylogenetic Tree Construction First download [genome.fna.gz](https://tu-dortmund.sciebo.de/s/N9Ljnt2cbFfnbMD) and prepare the mitogenome database with makeblastdb. ``` # Unzip the genome.fna.gz gunzip genome.fna.gz # Make database makeblastdb -in genome.fna -dbtype nucl -out mito_db ``` The next step is to align the sequences and to generate the tree. Copy the full script from below, and edit the species name and the mitogenome fasta (in ``` sed 's/^>.*/>species_name_query/' mitogenome.fasta > mito.fasta ```) and run it in a directory where you have the created database and the mitogenome fasta file. ``` # Find the longest (most complete) mitogenome in the fasta file # Example: # seqkit sort -l -r animal_mt.K115.scaffolds.graph1.1.path_sequence.fasta | seqkit head -n 1 > mito_longest.fasta seqkit sort -l -r mitogenome.fasta | seqkit head -n 1 > mito_longest.fasta # Rename the query FASTA header to a species-specific label # This ensures the query is recognizable in the final tree (e.g. colored red) # Example: # sed 's/^>.*/>Ailuropoda_melanoleuca_query/' mito_longest.fasta > mito.fasta sed 's/^>.*/>Your_species_query/' mito_longest.fasta > mito.fasta # Run BLAST to find similar mitochondrial genomes blastn -query mito.fasta -db mito_db \ -outfmt "6 sseqid bitscore" \ -max_target_seqs 50 -num_threads 4 > blast_hits.tsv # Sort hits by score, keep top 10 unique hits sort -k2,2nr blast_hits.tsv | awk '!seen[$1]++' | head -n 10 | cut -f1 > top10.ids # Extract corresponding sequences from reference genome database seqkit grep -f top10.ids /vol/storage/genome.fna > top10_hits.fasta # Add the renamed query sequence to the FASTA file cat mito.fasta >> top10_hits.fasta # Align all sequences with MAFFT mafft --auto top10_hits.fasta > aligned.fasta # Build a maximum-likelihood tree with FastTree FastTree -nt aligned.fasta > tree.nwk ``` ### Plotting the Phylogenetic Tree Download or copy the [**phylotree_editable.py**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/phylotree_editable.py) script and run it to plot the tree, you may need to alter the size specifications to ensure the plot fits on the PNG. ``` python phylotree_editable.py ``` Or activate python and paste ``` import toytree import toyplot.png from Bio import SeqIO # 1. Make an ID -> full header dictionary for the database db_fasta = "genome.fna" id2header = {} for record in SeqIO.parse(db_fasta, "fasta"): # Map only the first word after ">" to the full description id2header[record.id] = record.description # 2. Add the query ID and header from mito.fasta query_fasta = "mito.fasta" for record in SeqIO.parse(query_fasta, "fasta"): query_id = record.id query_header = record.description id2header[query_id] = query_header # 3. Load the tree and get its tip labels (IDs only) tree = toytree.tree("tree.nwk") tip_ids = tree.get_tip_labels() # These are just the IDs # 4. For each tip, use the full header as the label and color the query tip red tip_labels = [id2header.get(tip, tip) for tip in tip_ids] colors = ["red" if tip == query_id else "black" for tip in tip_ids] canvas = tree.draw( width=1000, height=600, tip_labels=tip_labels, tip_labels_colors=colors, )[0] toyplot.png.render(canvas, "mito_tree.png") ```