---
name: "plannotate-plasmid-annotation"
description: "Auto-annotate plasmids with features (promoters, terminators, resistance, origins, tags, fluorescent proteins) via BLAST against curated DBs (Addgene, fpbase, SnapGene). FASTA or raw sequence in; annotated GenBank, interactive HTML maps, CSV tables out. Handles circular topology. Use to verify synthetic constructs, prep Addgene submissions, share maps, or batch-annotate cloning libraries."
license: "GPL-3.0"
---

# pLannotate Plasmid Annotation

## Overview

pLannotate annotates plasmid sequences by running BLAST searches against a curated library of over 5,000 features sourced from Addgene, NCBI, and fpbase. It identifies promoters, terminators, antibiotic resistance genes, origins of replication, tags, and fluorescent proteins while correctly handling circular plasmid topology — avoiding split-feature artifacts that arise from naive linear alignment. Results are written as annotated GenBank files for downstream use in SnapGene, Benchling, or BioPython, as interactive HTML plasmid maps for sharing and review, and as CSV tables for programmatic filtering. Both a Python API and a command-line interface are provided; a Streamlit web app is also bundled for exploratory use.

## When to Use

- Annotating a plasmid sequence received from a collaborator or downloaded from Addgene with no accompanying map
- Verifying that all expected elements (promoter, insert, resistance marker, origin) are present after assembly or mutagenesis
- Preparing a GenBank submission or Addgene deposit that requires a complete feature table
- Batch-annotating a library of synthetic constructs produced by combinatorial cloning
- Generating a shareable interactive plasmid map (HTML) without requiring SnapGene or Benchling licenses
- Checking a de-novo synthesized gene block for unintended regulatory elements or cryptic ORFs before cloning
- Use **SnapGene** or **Benchling** instead when you need a full-featured GUI plasmid editor with primer design and cloning simulation workflows; pLannotate is best for automated, scriptable annotation
- Use **Prokka** instead when annotating a complete bacterial genome or a large linear chromosomal sequence; pLannotate is optimized for plasmid-sized sequences up to ~50 kb

## Prerequisites

- **Python packages**: `plannotate`, `biopython` (optional, for GenBank parsing)
- **System dependency**: BLAST+ must be available on PATH (installed automatically via conda; manual install needed for pip)
- **Input**: Plasmid sequence in FASTA format or as a plain Python string
- **Data requirements**: Sequences typically 1–20 kb; very large plasmids (>50 kb) may be slow

```bash
# Install via pip (requires BLAST+ on PATH)
pip install plannotate

# Install via conda (recommended — handles BLAST+ automatically)
conda install -c conda-forge -c bioconda plannotate

# Verify installation
plannotate --help
python -c "import plannotate; print('plannotate OK')"
```

## Quick Start

```python
from plannotate import annotate, write_genbank, create_bokeh_chart
from Bio import SeqIO

# Load plasmid from FASTA
record = next(SeqIO.parse("plasmid.fasta", "fasta"))
sequence = str(record.seq)

# Annotate (circular, against Addgene database)
results = annotate(sequence, linear=False, db="addgene")
print(f"Found {len(results)} features")
print(results[["Feature", "Feature_type", "pct_identity", "pct_query_cov"]].to_string())

# Export GenBank file
write_genbank(sequence, results, output_file="plasmid_annotated.gb")

# Generate interactive HTML map
create_bokeh_chart(sequence, results, output_file="plasmid_map.html")
print("Outputs: plasmid_annotated.gb, plasmid_map.html")
```

## Workflow

### Step 1: Load Plasmid Sequence

Load the plasmid sequence from a FASTA file, a GenBank file (stripping existing annotations for re-annotation), or a raw sequence string. Validate length and base composition before annotation.

```python
from Bio import SeqIO
import os

# Option A: Load from FASTA
def load_fasta(path):
    record = next(SeqIO.parse(path, "fasta"))
    seq = str(record.seq).upper()
    return seq, record.id

# Option B: Load from GenBank (strip annotations, keep sequence)
def load_genbank(path):
    record = next(SeqIO.parse(path, "genbank"))
    seq = str(record.seq).upper()
    return seq, record.id

# Option C: Raw sequence string
raw_seq = "ATGCGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTT"

# Validate sequence
def validate_plasmid(seq, name="plasmid"):
    valid_bases = set("ATGCNRYSWKMBDHV")
    invalid = set(seq.upper()) - valid_bases
    if invalid:
        raise ValueError(f"Invalid bases in {name}: {invalid}")
    if len(seq) < 100:
        raise ValueError(f"Sequence too short ({len(seq)} bp); minimum 100 bp")
    gc = (seq.count("G") + seq.count("C")) / len(seq) * 100
    print(f"{name}: {len(seq):,} bp, GC={gc:.1f}%")
    return seq

seq, plasmid_id = load_fasta("plasmid.fasta")
validate_plasmid(seq, plasmid_id)
```

### Step 2: Run BLAST-Based Annotation

Run annotation using the selected database. The `linear` flag controls whether the sequence is treated as circular (default for plasmids) or linear (for gene blocks and linear fragments).

```python
from plannotate import annotate

# Annotate circular plasmid against the Addgene database (most comprehensive for common vectors)
results = annotate(
    seq,
    linear=False,       # False = circular plasmid (default)
    db="addgene",       # Database: "addgene", "fpbase", or "snapgene"
)

print(f"Total features detected: {len(results)}")
print(f"\nColumns: {list(results.columns)}")

# Preview feature table
cols = ["Feature", "Feature_type", "start", "end", "strand", "pct_identity", "pct_query_cov"]
print(results[cols].sort_values("start").to_string(index=False))
```

### Step 3: Filter Features by Quality Thresholds

Review annotation confidence using BLAST identity and query coverage scores. High-confidence annotations have >95% identity and >90% coverage; partial hits may indicate truncated or mutated features.

```python
import pandas as pd

# Inspect hit quality distribution
print("Identity percentile summary:")
print(results["pct_identity"].describe().round(1))
print("\nCoverage percentile summary:")
print(results["pct_query_cov"].describe().round(1))

# Separate high- and low-confidence hits
high_conf = results[
    (results["pct_identity"] >= 95) &
    (results["pct_query_cov"] >= 90)
].copy()

low_conf = results[
    (results["pct_identity"] < 95) |
    (results["pct_query_cov"] < 90)
].copy()

print(f"\nHigh-confidence features (identity>=95%, coverage>=90%): {len(high_conf)}")
print(f"Low-confidence / partial features:                       {len(low_conf)}")

if not low_conf.empty:
    print("\nLow-confidence features (review manually):")
    print(low_conf[["Feature", "Feature_type", "pct_identity", "pct_query_cov"]].to_string(index=False))

# Save filtered table
results.to_csv("all_features.csv", index=False)
high_conf.to_csv("high_confidence_features.csv", index=False)
print("\nSaved: all_features.csv, high_confidence_features.csv")
```

### Step 4: Export Annotated GenBank File

Write the annotated sequence to GenBank format for import into plasmid editors (SnapGene, Benchling, Geneious, ApE) and for BioPython-based downstream analysis.

```python
from plannotate import write_genbank

# Write full annotation (all features)
write_genbank(seq, results, output_file="plasmid_annotated.gb")
print("Written: plasmid_annotated.gb")

# Write with high-confidence features only
write_genbank(seq, high_conf, output_file="plasmid_highconf.gb")
print("Written: plasmid_highconf.gb")

# Verify using BioPython
from Bio import SeqIO
record = next(SeqIO.parse("plasmid_annotated.gb", "genbank"))
print(f"\nGenBank verification:")
print(f"  Sequence length: {len(record.seq):,} bp")
print(f"  Features: {len(record.features)}")
for feat in record.features:
    label = feat.qualifiers.get("label", ["(unlabeled)"])[0]
    print(f"  [{feat.type:20s}] {label} @ {feat.location}")
```

### Step 5: Generate Interactive HTML Visualization

Create a Bokeh-based interactive plasmid map. The HTML file is self-contained and can be shared without any server infrastructure.

```python
from plannotate import create_bokeh_chart

# Generate interactive circular plasmid map
create_bokeh_chart(
    seq,
    results,
    output_file="plasmid_map.html",
)
print("Interactive map saved: plasmid_map.html")
print("Open in any browser — no server required")

# Tip: open automatically in the default browser
import webbrowser, os
webbrowser.open(f"file://{os.path.abspath('plasmid_map.html')}")
```

### Step 6: Parse GenBank Output with BioPython

Extract annotated features programmatically for downstream analysis — restriction site mapping, primer design, or construct verification reports.

```python
from Bio import SeqIO
from Bio.SeqFeature import FeatureLocation
import pandas as pd

record = next(SeqIO.parse("plasmid_annotated.gb", "genbank"))

# Build a feature DataFrame from the GenBank record
rows = []
for feat in record.features:
    label = feat.qualifiers.get("label", [""])[0]
    note  = feat.qualifiers.get("note",  [""])[0]
    rows.append({
        "type":   feat.type,
        "label":  label,
        "note":   note,
        "start":  int(feat.location.start),
        "end":    int(feat.location.end),
        "strand": feat.location.strand,
        "length": len(feat.location),
    })

feat_df = pd.DataFrame(rows)
print(feat_df.to_string(index=False))

# Example: find antibiotic resistance genes
resistance = feat_df[feat_df["label"].str.contains(
    r"AmpR|KanR|CmR|TetR|SpecR|HygR|ZeoR|BlastR|GentR",
    case=False, na=False, regex=True
)]
print(f"\nAntibiotic resistance markers found: {len(resistance)}")
print(resistance[["label", "start", "end", "length"]].to_string(index=False))
```

### Step 7: Batch Annotate Multiple Plasmids

Annotate an entire cloning library from a multi-FASTA file or a directory of individual FASTA files and aggregate results into a single summary table.

```python
from plannotate import annotate, write_genbank, create_bokeh_chart
from Bio import SeqIO
import pandas as pd
import os

input_dir  = "plasmids/"          # directory of *.fasta files
output_dir = "annotated_results/"
os.makedirs(output_dir, exist_ok=True)

summary_rows = []

for fasta_file in sorted(f for f in os.listdir(input_dir) if f.endswith(".fasta")):
    plasmid_name = fasta_file.replace(".fasta", "")
    fasta_path   = os.path.join(input_dir, fasta_file)

    record = next(SeqIO.parse(fasta_path, "fasta"))
    seq    = str(record.seq).upper()

    print(f"Annotating {plasmid_name} ({len(seq):,} bp)...", end=" ")
    results = annotate(seq, linear=False, db="addgene")
    print(f"{len(results)} features")

    # Save per-plasmid outputs
    write_genbank(
        seq, results,
        output_file=os.path.join(output_dir, f"{plasmid_name}.gb")
    )
    create_bokeh_chart(
        seq, results,
        output_file=os.path.join(output_dir, f"{plasmid_name}.html")
    )
    results.to_csv(os.path.join(output_dir, f"{plasmid_name}_features.csv"), index=False)

    # Accumulate for summary
    results["plasmid"] = plasmid_name
    summary_rows.append(results)

# Consolidated summary table
summary = pd.concat(summary_rows, ignore_index=True)
summary.to_csv(os.path.join(output_dir, "all_plasmids_features.csv"), index=False)
print(f"\nBatch complete. {summary['plasmid'].nunique()} plasmids annotated.")
print(f"Summary table: {output_dir}all_plasmids_features.csv")
```

## Key Parameters

| Parameter | Default | Range / Options | Effect |
|-----------|---------|-----------------|--------|
| `linear` | `False` | `True`, `False` | Treat sequence as linear (`True`) or circular (`False`); circular mode handles split features at the origin correctly |
| `db` | `"addgene"` | `"addgene"`, `"fpbase"`, `"snapgene"` | Feature database to search; `addgene` is broadest (promoters, resistance genes, origins, tags); `fpbase` adds fluorescent protein variants; `snapgene` includes SnapGene-curated features |
| `min_len` | `0` | `0`–`500` bp | Minimum feature length in bp; increase to suppress short spurious matches |
| `blast_identity_threshold` | `95` | `70`–`100` % | Minimum BLAST % identity to report a hit; lower values detect diverged homologs but increase false positives |
| `--html` (CLI) | off | flag | Generate interactive HTML plasmid map alongside GenBank output |
| `--csv` (CLI) | off | flag | Write CSV feature table to the output directory |
| `--linear` (CLI) | off | flag | Treat input as linear sequence (default is circular) |
| `--file` / `--input` (CLI) | required | FASTA path | Input plasmid sequence in FASTA format |

## Common Recipes

### Recipe: Launch Web App for Interactive Use

When to use: exploring a single plasmid interactively without writing code, or sharing with wet-lab collaborators who prefer a browser interface.

```bash
# Launch the pLannotate Streamlit web app (opens in browser at localhost:5000)
plannotate streamlit

# Or specify a custom port
plannotate streamlit --port 8501
```

### Recipe: CLI Batch Annotation

When to use: annotating multiple plasmids in a scripted workflow or on a remote server without Python scripting.

```bash
# Single plasmid
plannotate batch \
    --input plasmid.fasta \
    --output results/ \
    --html \
    --csv

# Multiple plasmids via shell glob
for f in plasmids/*.fasta; do
    name=$(basename "$f" .fasta)
    plannotate batch \
        --input "$f" \
        --output "annotated/${name}/" \
        --html --csv
done
echo "Done. Outputs in annotated/"
```

### Recipe: Compare Annotations Before and After Mutagenesis

When to use: verifying that a site-directed mutagenesis or insertion did not disrupt existing features or introduce unintended ones.

```python
from plannotate import annotate
from Bio import SeqIO
import pandas as pd

def annotation_diff(seq_before, seq_after, db="addgene"):
    res_before = annotate(seq_before, linear=False, db=db)
    res_after  = annotate(seq_after,  linear=False, db=db)

    features_before = set(res_before["Feature"])
    features_after  = set(res_after["Feature"])

    gained = features_after  - features_before
    lost   = features_before - features_after
    shared = features_before & features_after

    print(f"Shared features: {len(shared)}")
    print(f"Gained features: {gained if gained else 'none'}")
    print(f"Lost features:   {lost   if lost   else 'none'}")
    return res_before, res_after

seq_wt  = str(next(SeqIO.parse("plasmid_wt.fasta",  "fasta")).seq)
seq_mut = str(next(SeqIO.parse("plasmid_mut.fasta", "fasta")).seq)
res_before, res_after = annotation_diff(seq_wt, seq_mut)
```

### Recipe: Export Feature Table to Excel with Conditional Formatting

When to use: sharing annotation results with collaborators who prefer spreadsheets over GenBank files.

```python
from plannotate import annotate
from Bio import SeqIO
import pandas as pd

seq = str(next(SeqIO.parse("plasmid.fasta", "fasta")).seq)
results = annotate(seq, linear=False, db="addgene")

# Tidy column selection and renaming
export = results[[
    "Feature", "Feature_type", "start", "end", "strand",
    "pct_identity", "pct_query_cov", "database"
]].copy()
export.columns = [
    "Feature Name", "Type", "Start (bp)", "End (bp)", "Strand",
    "Identity (%)", "Coverage (%)", "Database"
]
export["Length (bp)"] = export["End (bp)"] - export["Start (bp)"]
export = export.sort_values("Start (bp)")

# Write to Excel with conditional formatting
with pd.ExcelWriter("plasmid_features.xlsx", engine="openpyxl") as writer:
    export.to_excel(writer, sheet_name="Features", index=False)

print(f"Exported {len(export)} features to plasmid_features.xlsx")
```

## Expected Outputs

| Output File | Format | Description |
|-------------|--------|-------------|
| `plasmid_annotated.gb` | GenBank | Sequence with annotated features; importable into SnapGene, Benchling, Geneious, ApE, BioPython |
| `plasmid_map.html` | HTML | Self-contained interactive circular plasmid map (Bokeh); shareable without a server |
| `all_features.csv` | CSV | Tabular feature list with columns: Feature, Feature_type, start, end, strand, pct_identity, pct_query_cov, database |
| `high_confidence_features.csv` | CSV | Filtered subset with identity >= 95% and coverage >= 90% |
| `all_plasmids_features.csv` | CSV | Batch mode: aggregated features across all plasmids with a `plasmid` column |

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| `FileNotFoundError: blastn not found` | BLAST+ not on PATH | Install via conda: `conda install -c bioconda blast`; or via package manager: `brew install blast` (macOS) / `apt install ncbi-blast+` (Linux) |
| `No features detected` | Sequence is too short, wrong database, or non-standard bases | Verify sequence length >= 500 bp; try a different `db` (e.g., `"fpbase"` for fluorescent protein vectors); check for ambiguous bases with `validate_plasmid()` |
| Annotations wrap incorrectly at position 0 | Sequence treated as linear when it is circular | Set `linear=False` (default); this enables circular BLAST to catch features that span the sequence origin |
| HTML map renders blank | `bokeh` version mismatch | Upgrade: `pip install --upgrade bokeh`; pLannotate requires Bokeh >=2.4 |
| Low identity hits for known features | Feature sequence has been mutated or codon-optimized | Lower `blast_identity_threshold` to 85–90%; add a note that these are diverged homologs |
| `MemoryError` or very slow annotation | Sequence > 50 kb or BLAST database not indexed | Split large sequences into sub-regions; ensure the internal pLannotate database index exists (reinstall if needed) |
| GenBank file not parsed by SnapGene | Non-standard feature type labels | Open in Geneious or BioPython first; check for special characters in feature qualifiers |

## References

- [pLannotate GitHub: mmcguffi/pLannotate](https://github.com/mmcguffi/pLannotate) — source code, issue tracker, and installation instructions
- [McGuffi M et al. (2021) Nucleic Acids Research 49(W1): W516–W522](https://doi.org/10.1093/nar/gkab374) — original pLannotate paper describing the BLAST-based annotation pipeline and curated feature database
- [pLannotate PyPI: plannotate](https://pypi.org/project/plannotate/) — package metadata, version history, and pip install info
- [Addgene plasmid repository](https://www.addgene.org/) — primary source for the curated feature library used by pLannotate
- [BioPython SeqIO documentation](https://biopython.org/wiki/SeqIO) — parsing and manipulating GenBank output files