--- name: "prokka-genome-annotation" description: "Annotate prokaryotic genomes (bacteria, archaea, viruses) via Prokka's BLAST/HMM pipeline. Identifies CDS, rRNA, tRNA, tmRNA, signal peptides against Pfam, TIGRFAMs, RefSeq. Outputs GFF3, GenBank, FASTA, TSV. Use PGAP for NCBI GenBank submission; Bakta for faster NCBI-compatible annotation." license: "GPL-3.0" --- # Prokka Genome Annotation ## Overview Prokka is a command-line pipeline for rapid annotation of prokaryotic genomes (bacteria, archaea, and viruses). It uses a tiered search strategy: protein-coding genes (CDS) are predicted with Prodigal and searched first against a genus-specific database, then RefSeq proteins, then Pfam/TIGRFAMs HMMs. Non-coding RNA genes (rRNA, tRNA, tmRNA) are identified with Barrnap, Aragorn, and Infernal. Prokka processes a single FASTA assembly in minutes and outputs a comprehensive annotation in GFF3, GenBank, FASTA, and tabular formats. ## When to Use - Annotating a newly assembled bacterial or archaeal genome from Illumina, PacBio, or Nanopore assemblies - Getting functional protein annotations (CDS with product names, EC numbers, GO terms) from a draft or complete genome - Preparing annotation files for downstream comparative genomics (Roary pan-genome, OrthoFinder) - Annotating viral or phage genomes when kingdom-specific databases are important - Performing metagenome-assembled genome (MAG) annotation with the `--metagenome` flag - Parsing annotated outputs in Python with BioPython for downstream sequence or feature analysis - Use **PGAP** (NCBI Prokaryotic Genome Annotation Pipeline) instead when the goal is NCBI GenBank submission with standards compliance - Use **Bakta** instead for faster annotation with built-in NCBI-compatible outputs and a more regularly updated database ## Prerequisites - **Software**: Prokka ≥ 1.14, Perl 5, Prodigal, Barrnap, HMMER3, BLAST+, Aragorn, Infernal, tbl2asn - **Python packages** (for output parsing): `biopython`, `pandas`, `matplotlib` - **Input**: assembled genome in FASTA format (complete or draft with multiple contigs) - **Environment**: conda strongly recommended to handle the Perl and C dependency stack > **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v prokka` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run prokka` rather than bare `prokka`. ```bash # Install Prokka via conda/mamba (recommended) conda install -c conda-forge -c bioconda prokka # Or with mamba (faster) mamba install -c conda-forge -c bioconda prokka # Verify installation and database setup prokka --version # prokka 1.14.6 # Check that required tools are on PATH prokka --depends # prokka needs: awk, sed, grep, makeblastdb, blastp, hmmscan, ... # Install Python parsing dependencies pip install biopython pandas matplotlib ``` ## Quick Start ```bash # Annotate a bacterial genome assembly — results in results/ directory prokka genome.fasta \ --outdir results/ \ --prefix sample1 \ --kingdom Bacteria \ --cpus 4 # Check output summary cat results/sample1.txt # Organism: Genus species strain # Contigs: 1 # Bases: 4639675 # CDS: 4140 # rRNA: 22 # tRNA: 86 echo "Annotation complete. Key output files:" ls results/sample1.{gff,gbk,faa,ffn,tsv} ``` ## Workflow ### Step 1: Install and Verify Prokka Install Prokka and confirm all dependent tools are accessible in the current environment. ```bash # Create a dedicated conda environment conda create -n prokka_env -c conda-forge -c bioconda prokka python=3.10 -y conda activate prokka_env # Verify Prokka version and all tool dependencies prokka --version # prokka 1.14.6 prokka --depends # Checking that required tools are installed... # OK: makeblastdb is installed (2.13.0+) # OK: blastp is installed (2.13.0+) # OK: hmmscan is installed (3.3.2) # OK: prodigal is installed (2.6.3) # OK: barrnap is installed (0.9) # Check available genus-specific databases bundled with Prokka ls $(conda info --base)/envs/prokka_env/db/genus/ # Archaea Bacteria Mitochondria Viruses # Install Python parsing tools pip install biopython pandas matplotlib ``` ### Step 2: Prepare the Input Genome Clean and rename contigs to comply with Prokka's header requirements before annotation. ```python from Bio import SeqIO import re # Load and inspect assembly input_fasta = "genome.fasta" records = list(SeqIO.parse(input_fasta, "fasta")) print(f"Input assembly: {len(records)} contigs") total_bases = sum(len(r) for r in records) print(f"Total bases: {total_bases:,}") print(f"Largest contig: {max(len(r) for r in records):,} bp") print(f"N50 approx: see assembly stats tool") # Rename contigs to short IDs compatible with Prokka (max 37 chars) # Prokka requires: no spaces, no special characters in header cleaned = [] for i, rec in enumerate(records, 1): new_id = f"contig_{i:04d}" new_rec = rec.__class__(rec.seq, id=new_id, description=f"len={len(rec.seq)}") cleaned.append(new_rec) SeqIO.write(cleaned, "genome_clean.fasta", "fasta") print(f"\nWrote genome_clean.fasta with {len(cleaned)} renamed contigs") # genome_clean.fasta: contig_0001 through contig_NNNN ``` ```bash # Alternatively, clean headers with a simple bash one-liner awk '/^>/{print ">contig_" ++i; next}{print}' genome.fasta > genome_clean.fasta # Filter out short contigs (< 200 bp) to reduce annotation noise awk '/^>/{header=$0; next} length($0) >= 200 {print header; print}' \ genome_clean.fasta > genome_filtered.fasta echo "Filtered assembly ready: $(grep -c '>' genome_filtered.fasta) contigs" ``` ### Step 3: Run Basic Prokka Annotation Run Prokka with standard options for a bacterial genome, specifying genus/species for database selection. ```bash # Basic annotation with genus/species hint (uses genus-specific protein database first) prokka genome_clean.fasta \ --outdir annotation/ \ --prefix E_coli_K12 \ --kingdom Bacteria \ --genus Escherichia \ --species coli \ --strain K12 \ --cpus 8 \ --mincontiglen 200 # Expected runtime: 2–10 minutes for a typical 4–6 Mb bacterial genome echo "Prokka annotation output files:" ls annotation/ # E_coli_K12.err E_coli_K12.faa E_coli_K12.ffn # E_coli_K12.fna E_coli_K12.gbk E_coli_K12.gff # E_coli_K12.log E_coli_K12.sqn E_coli_K12.tbl # E_coli_K12.tsv E_coli_K12.txt ``` ### Step 4: Parse Annotation Summary (TSV) Load the TSV output for a quick overview of annotated features and their functional assignments. ```python import pandas as pd # Load the annotation TSV (tab-delimited feature table) tsv_file = "annotation/E_coli_K12.tsv" df = pd.read_csv(tsv_file, sep="\t") print(f"Total features: {len(df)}") print(f"Columns: {list(df.columns)}") # Columns: [locus_tag, ftype, length_bp, gene, EC_number, COG, product] # Feature type summary print("\nFeature type counts:") print(df["ftype"].value_counts().to_string()) # CDS 4140 # tRNA 86 # rRNA 22 # tmRNA 1 # Functional gene annotations (non-hypothetical CDS) cds_df = df[df["ftype"] == "CDS"].copy() hypothetical = cds_df["product"].str.contains("hypothetical", case=False, na=True) print(f"\nCDS with known function: {(~hypothetical).sum()}") print(f"Hypothetical proteins: {hypothetical.sum()}") # Genes with EC numbers (enzymes) ec_annotated = cds_df[cds_df["EC_number"].notna() & (cds_df["EC_number"] != "")] print(f"CDS with EC numbers: {len(ec_annotated)}") print(ec_annotated[["locus_tag", "gene", "EC_number", "product"]].head(5).to_string(index=False)) ``` ### Step 5: Parse GenBank Output with BioPython Read the GenBank file to access per-gene sequences, qualifiers, and feature coordinates. ```python from Bio import SeqIO import pandas as pd # Parse GenBank file gbk_file = "annotation/E_coli_K12.gbk" records = list(SeqIO.parse(gbk_file, "genbank")) print(f"Contigs in GenBank: {len(records)}") # Iterate over CDS features and extract details rows = [] for rec in records: for feat in rec.features: if feat.type != "CDS": continue qualifiers = feat.qualifiers rows.append({ "contig": rec.id, "locus_tag": qualifiers.get("locus_tag", ["?"])[0], "gene": qualifiers.get("gene", [""])[0], "product": qualifiers.get("product", ["hypothetical protein"])[0], "EC_number": qualifiers.get("EC_number", [""])[0], "protein_id": qualifiers.get("protein_id", [""])[0], "start": int(feat.location.start), "end": int(feat.location.end), "strand": feat.location.strand, "aa_length": len(qualifiers.get("translation", [""])[0]), }) features_df = pd.DataFrame(rows) print(f"CDS features extracted: {len(features_df)}") print(features_df.head(3).to_string(index=False)) # Retrieve protein sequence for a specific gene gene_name = "dnaA" gene_feat = features_df[features_df["gene"] == gene_name] if not gene_feat.empty: idx = gene_feat.index[0] print(f"\n{gene_name}: {features_df.loc[idx, 'aa_length']} aa") ``` ### Step 6: Visualize Annotation Statistics Generate a summary barplot of feature types and functional annotation coverage. ```python import pandas as pd import matplotlib.pyplot as plt import matplotlib.patches as mpatches tsv_file = "annotation/E_coli_K12.tsv" df = pd.read_csv(tsv_file, sep="\t") cds = df[df["ftype"] == "CDS"].copy() n_cds = len(cds) n_known = (~cds["product"].str.contains("hypothetical", case=False, na=True)).sum() n_hypo = n_cds - n_known n_ec = cds["EC_number"].notna().sum() n_rrna = (df["ftype"] == "rRNA").sum() n_trna = (df["ftype"] == "tRNA").sum() fig, axes = plt.subplots(1, 2, figsize=(11, 4)) # Left panel: feature type counts types = df["ftype"].value_counts() colors_left = ["#2980B9", "#27AE60", "#E74C3C", "#F39C12"] + ["#95A5A6"] * len(types) axes[0].bar(types.index, types.values, color=colors_left[:len(types)], edgecolor="white") axes[0].set_title("Annotated Feature Counts") axes[0].set_xlabel("Feature Type") axes[0].set_ylabel("Count") for i, (label, val) in enumerate(types.items()): axes[0].text(i, val + 5, str(val), ha="center", fontsize=9) # Right panel: CDS functional annotation breakdown labels = ["Known function", "Hypothetical protein", "Enzyme (EC number)"] sizes = [n_known, n_hypo, n_ec] colors_right = ["#2980B9", "#BDC3C7", "#E74C3C"] bars = axes[1].bar(labels, sizes, color=colors_right, edgecolor="white") axes[1].set_title(f"CDS Functional Annotation\n(n={n_cds} total CDS)") axes[1].set_ylabel("Count") axes[1].tick_params(axis="x", rotation=15) for bar, val in zip(bars, sizes): axes[1].text(bar.get_x() + bar.get_width() / 2, val + 10, str(val), ha="center", fontsize=9) plt.suptitle("Prokka Genome Annotation Summary", fontsize=12, fontweight="bold") plt.tight_layout() plt.savefig("prokka_annotation_summary.png", dpi=150, bbox_inches="tight") print(f"Saved prokka_annotation_summary.png") print(f"CDS: {n_cds} | Known function: {n_known} | Hypothetical: {n_hypo}") print(f"rRNA: {n_rrna} | tRNA: {n_trna} | EC-annotated CDS: {n_ec}") ``` ### Step 7: Batch Annotation Across Multiple Genomes Annotate multiple genome assemblies sequentially and collect summary statistics. ```bash #!/bin/bash # batch_prokka.sh — annotate all FASTA files in a directory INPUT_DIR="genomes/" OUTPUT_DIR="annotations/" mkdir -p "$OUTPUT_DIR" for FASTA in "$INPUT_DIR"/*.fasta; do SAMPLE=$(basename "$FASTA" .fasta) echo "Annotating: $SAMPLE" prokka "$FASTA" \ --outdir "${OUTPUT_DIR}/${SAMPLE}" \ --prefix "$SAMPLE" \ --kingdom Bacteria \ --cpus 4 \ --mincontiglen 200 \ --quiet echo " Done: ${OUTPUT_DIR}/${SAMPLE}/${SAMPLE}.txt" done echo "Batch annotation complete." ``` ```python # Collect summary statistics from all Prokka .txt files from pathlib import Path import pandas as pd annotation_dir = Path("annotations/") rows = [] for txt_file in sorted(annotation_dir.glob("*/*.txt")): sample = txt_file.stem stats = {} with open(txt_file) as f: for line in f: line = line.strip() if ": " in line: key, val = line.split(": ", 1) stats[key.strip()] = val.strip() rows.append({ "sample": sample, "contigs": int(stats.get("Contigs", 0)), "bases": int(stats.get("Bases", 0)), "CDS": int(stats.get("CDS", 0)), "rRNA": int(stats.get("rRNA", 0)), "tRNA": int(stats.get("tRNA", 0)), }) summary_df = pd.DataFrame(rows) print(f"Annotated genomes: {len(summary_df)}") print(summary_df.to_string(index=False)) summary_df.to_csv("batch_annotation_summary.csv", index=False) print("\nSaved: batch_annotation_summary.csv") ``` ### Step 8: Compare Annotations Between Strains Use the protein FASTA outputs for cross-strain comparison with identity-based clustering. ```python from Bio import SeqIO, pairwise2 import pandas as pd # Load protein sequences from two strains def load_proteins(faa_file): """Return dict of locus_tag -> protein sequence.""" return {rec.id: str(rec.seq) for rec in SeqIO.parse(faa_file, "fasta")} strain_a = load_proteins("annotation_A/strain_A.faa") strain_b = load_proteins("annotation_B/strain_B.faa") print(f"Strain A proteins: {len(strain_a)}") print(f"Strain B proteins: {len(strain_b)}") # Compare gene counts and size distributions import matplotlib.pyplot as plt import numpy as np len_a = [len(seq) for seq in strain_a.values()] len_b = [len(seq) for seq in strain_b.values()] fig, ax = plt.subplots(figsize=(8, 4)) bins = np.linspace(0, 1500, 50) ax.hist(len_a, bins=bins, alpha=0.6, label=f"Strain A (n={len(len_a)})", color="#2980B9") ax.hist(len_b, bins=bins, alpha=0.6, label=f"Strain B (n={len(len_b)})", color="#E74C3C") ax.set_xlabel("Protein Length (aa)") ax.set_ylabel("Count") ax.set_title("Protein Length Distribution by Strain") ax.legend() plt.tight_layout() plt.savefig("strain_protein_length_comparison.png", dpi=150, bbox_inches="tight") print("Saved strain_protein_length_comparison.png") # Find proteins unique to each strain by size/count difference print(f"\nSize difference: {abs(len(strain_a) - len(strain_b))} proteins") print("→ Use Roary or OrthoFinder for formal pan-genome analysis") ``` ## Key Parameters | Parameter | Default | Range / Options | Effect | |-----------|---------|-----------------|--------| | `--kingdom` | `Bacteria` | `Bacteria`, `Archaea`, `Viruses`, `Mitochondria` | Selects gene prediction model and default databases | | `--genus` | — | any genus name string | Prioritizes genus-specific protein database for CDS annotation | | `--species` | — | any species name string | Combined with `--genus` for locus_tag prefix and organism metadata | | `--strain` | — | any string | Added to organism metadata in GenBank output | | `--proteins` | — | path to FASTA file | Custom protein database prepended before RefSeq search | | `--hmms` | — | path to HMM file | Additional custom HMM database for specialized annotation | | `--evalue` | `1e-6` | `1e-4`–`1e-9` | E-value cutoff for BLAST and HMMER hits | | `--cpus` | `8` | `1`–CPU count | Parallel processes for BLAST and HMMER searches | | `--mincontiglen` | `1` | any integer | Skip contigs shorter than this length (bp) | | `--metagenome` | off | flag | Disables Prodigal training on this genome (for MAGs) | | `--rfam` | off | flag | Enable Infernal rRNA/ncRNA search against Rfam (slower) | | `--norrna` | off | flag | Skip rRNA prediction (use when assembly has no rRNA genes) | | `--notrna` | off | flag | Skip tRNA prediction | ## Common Recipes ### Recipe: Annotation with Custom Protein Database When to use: You have closely related reference proteins (e.g., characterized isolate) to improve annotation accuracy. ```bash # Use a custom protein database to enhance annotation of a novel strain # Custom proteins are searched first, before internal Prokka databases prokka genome.fasta \ --proteins reference_proteins.faa \ --outdir custom_annotation/ \ --prefix novel_strain \ --kingdom Bacteria \ --cpus 4 echo "Custom DB annotation complete:" grep "CDS" custom_annotation/novel_strain.txt ``` ### Recipe: Metagenome-Assembled Genome (MAG) Annotation When to use: Annotating a MAG where Prodigal cannot train on the full genome sequence. ```bash # --metagenome disables Prodigal model training (uses meta mode) # --mincontiglen 500 discards short, potentially chimeric contigs prokka MAG_bin_42.fasta \ --metagenome \ --kingdom Bacteria \ --outdir mag_annotation/ \ --prefix MAG_bin_42 \ --mincontiglen 500 \ --cpus 4 echo "MAG annotation complete:" cat mag_annotation/MAG_bin_42.txt ``` ### Recipe: Extract Specific Gene Sequences When to use: Retrieve nucleotide or protein sequences for a target gene or pathway from the annotation. ```python from Bio import SeqIO # Extract all sequences for a specific gene name from protein FASTA faa_file = "annotation/E_coli_K12.faa" target_gene = "dnaA" matches = [] for rec in SeqIO.parse(faa_file, "fasta"): # Prokka FASTA header: >locus_tag gene product if target_gene in rec.description: matches.append(rec) print(f"Proteins matching '{target_gene}': {len(matches)}") for rec in matches: print(f" {rec.id}: {len(rec.seq)} aa — {rec.description}") # Save matching sequences if matches: SeqIO.write(matches, f"{target_gene}_proteins.faa", "fasta") print(f"Saved: {target_gene}_proteins.faa") ``` ### Recipe: Convert GFF to Pandas DataFrame When to use: Work with genomic coordinates for feature overlap analysis or visualization. ```python import pandas as pd def parse_gff(gff_file): """Parse Prokka GFF3 file into a DataFrame (skips sequence section).""" rows = [] with open(gff_file) as f: for line in f: if line.startswith("##FASTA"): break if line.startswith("#") or not line.strip(): continue parts = line.rstrip("\n").split("\t") if len(parts) < 9: continue attr_dict = {} for item in parts[8].split(";"): if "=" in item: k, v = item.split("=", 1) attr_dict[k] = v rows.append({ "seqname": parts[0], "source": parts[1], "feature": parts[2], "start": int(parts[3]), "end": int(parts[4]), "score": parts[5], "strand": parts[6], "frame": parts[7], "locus_tag": attr_dict.get("ID", ""), "gene": attr_dict.get("gene", ""), "product": attr_dict.get("product", ""), }) return pd.DataFrame(rows) gff_df = parse_gff("annotation/E_coli_K12.gff") print(f"GFF features: {len(gff_df)}") print(gff_df[gff_df["feature"] == "CDS"].head(5)[ ["seqname", "start", "end", "strand", "gene", "product"] ].to_string(index=False)) ``` ## Expected Outputs | Output File | Format | Description | |-------------|--------|-------------| | `{prefix}.gff` | GFF3 | Genome annotation with feature coordinates and attributes; includes FASTA sequence at the end | | `{prefix}.gbk` | GenBank | Full GenBank-format annotation for use in Geneious, BioPython, and submission prep | | `{prefix}.faa` | FASTA | Predicted protein sequences for all CDS features | | `{prefix}.ffn` | FASTA | Nucleotide sequences for all annotated features (CDS, rRNA, tRNA) | | `{prefix}.fna` | FASTA | Nucleotide FASTA of the complete assembly (contigs) | | `{prefix}.tsv` | TSV | Tab-delimited summary table: locus_tag, ftype, length_bp, gene, EC_number, COG, product | | `{prefix}.txt` | Text | One-line summary counts: Contigs, Bases, CDS, rRNA, tRNA, tmRNA | | `{prefix}.tbl` | TBL | Feature table format for tbl2asn GenBank submission | | `{prefix}.sqn` | SQN | ASN.1 format for NCBI submission (generated by tbl2asn) | | `{prefix}.err` | Text | Warnings and errors from tbl2asn validation step | | `{prefix}.log` | Text | Full Prokka run log with timing per step | ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `FATAL: Can't find any tRNA genes` | No tRNA detected on short or highly fragmented assembly | Add `--notrna` flag; check if assembly is too fragmented (N50 < 1 kb) | | `Can't exec "makeblastdb"` | BLAST+ not on PATH | `conda install -c bioconda blast`; ensure prokka env is activated | | `Contig ID too long (>37 chars)` | Assembler produced long contig headers | Pre-process FASTA to shorten headers: `awk '/^>/{print ">contig_"++i; next}{print}'` | | Very high hypothetical protein rate (>60%) | Divergent organism with few database matches | Add `--proteins` with closely related strain FAA; consider `--genus` flag | | `ERROR: Argument --proteins: file does not exist` | Path to custom protein file is incorrect | Use absolute path; verify file exists with `ls -la custom.faa` | | `tbl2asn error: multiple /product` | Duplicate product qualifiers in annotation | Ignore if exporting for local use; for NCBI submission, use PGAP instead | | Annotation is very slow (>30 min for ~5 Mb) | `--cpus` not set or set to 1; `--rfam` enabled | Set `--cpus` to available thread count; disable `--rfam` for faster runs | | rRNA count is 0 for complete genome | `--norrna` flag was set, or barrnap threshold too strict | Remove `--norrna`; check barrnap is installed with `barrnap --version` | ## References - [Prokka GitHub: tseemann/prokka](https://github.com/tseemann/prokka) — source code, installation instructions, and detailed parameter reference - [Seemann T (2014) Bioinformatics 30(14):2068–2069](https://doi.org/10.1093/bioinformatics/btu153) — original Prokka paper with benchmarks vs. RAST - [Prodigal documentation](https://github.com/hyattpd/Prodigal/wiki) — gene prediction engine used by Prokka for CDS calling - [Barrnap GitHub: tseemann/barrnap](https://github.com/tseemann/barrnap) — rRNA prediction tool bundled with Prokka - [BioPython GenBank parsing tutorial](https://biopython.org/wiki/SeqIO) — reference for parsing Prokka GenBank output in Python