---
name: biopython
description: Python toolkit for computational biology. Use when asked to "parse FASTA", "read GenBank", "query NCBI", "run BLAST", "analyze protein structure", "build phylogenetic tree", or work with biological sequences. Handles sequence I/O, database access, alignments, structure analysis, and phylogenetics.
---

# Biopython: Python Tools for Computational Biology

## Summary

Biopython (v1.85+) delivers a comprehensive Python library for biological data analysis. It requires Python 3 and NumPy, providing modular components for sequences, alignments, database access, BLAST, structures, and phylogenetics.

## Applicable Scenarios

This skill applies when you need to:

| Task Category | Examples |
|---------------|----------|
| Sequence Operations | Create, modify, translate DNA/RNA/protein sequences |
| File Format Handling | Parse or convert FASTA, GenBank, FASTQ, PDB, mmCIF |
| NCBI Database Access | Query GenBank, PubMed, Protein, Gene, Taxonomy |
| Similarity Searches | Execute BLAST locally or via NCBI, parse results |
| Alignment Work | Pairwise or multiple sequence alignments |
| Structural Analysis | Parse PDB files, compute distances, DSSP assignment |
| Tree Construction | Build, manipulate, visualize phylogenetic trees |
| Motif Discovery | Find and score sequence patterns |
| Sequence Statistics | GC content, molecular weight, melting temperature |

## Module Organization

| Module | Purpose | Reference |
|--------|---------|-----------|
| Bio.Seq / Bio.SeqIO | Sequence objects and file I/O | `references/sequence-io.md` |
| Bio.Align / Bio.AlignIO | Pairwise and multiple alignments | `references/alignment.md` |
| Bio.Entrez | NCBI database programmatic access | `references/databases.md` |
| Bio.Blast | BLAST execution and result parsing | `references/blast.md` |
| Bio.PDB | 3D structure manipulation | `references/structure.md` |
| Bio.Phylo | Phylogenetic tree operations | `references/phylogenetics.md` |
| Bio.motifs, Bio.SeqUtils, etc. | Motifs, utilities, restriction sites | `references/advanced.md` |

## Setup

Install via pip:

```python
uv pip install biopython
```

Configure NCBI access (mandatory for Entrez operations):

```python
from Bio import Entrez

Entrez.email = "researcher@institution.edu"
Entrez.api_key = "your_ncbi_api_key"  # Optional: increases rate limit to 10 req/s
```

## Quick Reference

### Parse Sequences

```python
from Bio import SeqIO

records = SeqIO.parse("data.fasta", "fasta")
for rec in records:
    print(f"{rec.id}: {len(rec)} bp")
```

### Translate DNA

```python
from Bio.Seq import Seq

dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
protein = dna.translate()
```

### Query NCBI

```python
from Bio import Entrez

Entrez.email = "researcher@institution.edu"
handle = Entrez.esearch(db="nucleotide", term="insulin[Gene] AND human[Organism]")
results = Entrez.read(handle)
handle.close()
```

### Run BLAST

```python
from Bio.Blast import NCBIWWW, NCBIXML

result = NCBIWWW.qblast("blastp", "swissprot", "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQQIAAALEHHHHHH")
record = NCBIXML.read(result)
```

### Parse Protein Structure

```python
from Bio.PDB import PDBParser

parser = PDBParser(QUIET=True)
structure = parser.get_structure("protein", "structure.pdb")
for atom in structure.get_atoms():
    print(atom.name, atom.coord)
```

### Build Phylogenetic Tree

```python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

alignment = AlignIO.read("aligned.fasta", "fasta")
calc = DistanceCalculator("identity")
dm = calc.get_distance(alignment)
tree = DistanceTreeConstructor().nj(dm)
Phylo.draw_ascii(tree)
```

## Reference Files

| File | Contents |
|------|----------|
| `references/sequence-io.md` | Bio.Seq objects, SeqIO parsing/writing, large file handling, format conversion |
| `references/alignment.md` | Pairwise alignment, BLOSUM matrices, AlignIO, external aligners |
| `references/databases.md` | NCBI Entrez API, esearch/efetch/elink, batch downloads, search syntax |
| `references/blast.md` | Remote/local BLAST, XML parsing, result filtering, batch queries |
| `references/structure.md` | Bio.PDB, SMCRA hierarchy, DSSP, superimposition, spatial queries |
| `references/phylogenetics.md` | Tree I/O, distance matrices, tree construction, consensus, visualization |
| `references/advanced.md` | Motifs, SeqUtils, restriction enzymes, population genetics, GenomeDiagram |

## Implementation Patterns

### Retrieve and Analyze GenBank Record

```python
from Bio import Entrez, SeqIO
from Bio.SeqUtils import gc_fraction

Entrez.email = "researcher@institution.edu"

handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

print(f"Organism: {record.annotations['organism']}")
print(f"Length: {len(record)} bp")
print(f"GC: {gc_fraction(record.seq):.1%}")
```

### Batch Sequence Processing

```python
from Bio import SeqIO
from Bio.SeqUtils import gc_fraction

output_records = []
for record in SeqIO.parse("input.fasta", "fasta"):
    if len(record) >= 200 and gc_fraction(record.seq) > 0.4:
        output_records.append(record)

SeqIO.write(output_records, "filtered.fasta", "fasta")
```

### BLAST with Result Filtering

```python
from Bio.Blast import NCBIWWW, NCBIXML

query = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
result_handle = NCBIWWW.qblast("blastp", "nr", query, hitlist_size=20)
record = NCBIXML.read(result_handle)

for alignment in record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < 1e-10:
            identity_pct = (hsp.identities / hsp.align_length) * 100
            print(f"{alignment.accession}: {identity_pct:.1f}% identity, E={hsp.expect:.2e}")
```

### Phylogeny from Alignment

```python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
import matplotlib.pyplot as plt

alignment = AlignIO.read("sequences.aln", "clustal")
calculator = DistanceCalculator("blosum62")
dm = calculator.get_distance(alignment)

constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
tree.root_at_midpoint()
tree.ladderize()

fig, ax = plt.subplots(figsize=(12, 8))
Phylo.draw(tree, axes=ax)
fig.savefig("phylogeny.png", dpi=150)
```

## Guidelines

**Imports**: Use explicit imports
```python
from Bio import SeqIO, Entrez
from Bio.Seq import Seq
```

**File Handling**: Always close handles or use context managers
```python
with open("sequences.fasta") as f:
    for record in SeqIO.parse(f, "fasta"):
        process(record)
```

**Memory Efficiency**: Use iterators for large datasets
```python
# Correct: iterate without loading all
for record in SeqIO.parse("huge.fasta", "fasta"):
    if meets_criteria(record):
        yield record

# Avoid: loading entire file
all_records = list(SeqIO.parse("huge.fasta", "fasta"))
```

**Error Handling**: Wrap network operations
```python
from urllib.error import HTTPError

try:
    handle = Entrez.efetch(db="nucleotide", id=accession)
    record = SeqIO.read(handle, "genbank")
except HTTPError as e:
    print(f"Fetch failed: {e.code}")
```

**NCBI Compliance**: Set email, respect rate limits, cache downloads locally

## Troubleshooting

| Issue | Resolution |
|-------|------------|
| "No handlers could be found for logger 'Bio.Entrez'" | Set `Entrez.email` before any queries |
| HTTP 400 from NCBI | Verify accession/ID format is correct |
| "ValueError: EOF" during parse | Confirm file format matches format string |
| Alignment length mismatch | Sequences must be pre-aligned for AlignIO |
| Slow BLAST queries | Use local BLAST for large-scale searches |
| PDB parser warnings | Use `PDBParser(QUIET=True)` or check structure quality |

## External Resources

- Biopython Documentation: https://biopython.org/docs/latest/
- Biopython Tutorial: https://biopython.org/docs/latest/Tutorial/
- GitHub Repository: https://github.com/biopython/biopython