--- name: "biopython-sequence-analysis" description: "Biopython sequence analysis: parse FASTA/FASTQ/GenBank/GFF (SeqIO), NCBI Entrez (esearch/efetch/elink), remote/local BLAST, pairwise/MSA alignment (PairwiseAligner, MUSCLE/ClustalW), phylogenetic trees (Phylo). Use for gene family studies, phylogenomics, comparative genomics, NCBI pipelines. For PCR/restriction/cloning use biopython-molecular-biology; for SAM/BAM use pysam." license: "Biopython License (BSD-like)" --- # Biopython: Sequence Analysis Toolkit ## Overview Biopython provides a comprehensive suite of modules for sequence-centric bioinformatics: reading and writing every major biological file format (FASTA, FASTQ, GenBank, GFF), querying NCBI databases programmatically, running BLAST searches and parsing results, aligning sequences pairwise or in multiple-sequence alignments, and building and visualizing phylogenetic trees. This skill focuses on analysis workflows — from NCBI data retrieval through alignment to phylogenetic inference. For PCR primer design, restriction enzyme digestion, cloning simulation, protein structure analysis (Bio.PDB), and molecular weight/Tm calculations, see **biopython-molecular-biology**. ## When to Use - Download a gene family from NCBI Nucleotide/Protein, align sequences, and construct a phylogenetic tree - Parse GenBank or GFF3 annotation files and extract CDS sequences for a set of features - Run a BLAST search against NCBI `nt` or `nr`, filter significant hits, and fetch their full sequences - Compute pairwise sequence identities or score alignments with BLOSUM62/PAM250 matrices - Index a large multi-FASTA or FASTQ file with `SeqIO.index()` for random-access retrieval without loading all sequences into RAM - Convert between sequence formats (FASTA ↔ GenBank ↔ FASTQ ↔ PHYLIP) in a single call - Traverse, root, prune, and annotate a Newick or Nexus phylogenetic tree programmatically - Use **pysam** instead when working with SAM/BAM/CRAM alignment files and mapped reads - Use **scikit-bio** instead when you need ecological diversity metrics (UniFrac, beta diversity) or ordination methods such as PCoA - Use **gget** instead for quick gene lookups and one-liner NCBI queries without writing an Entrez pipeline ## Prerequisites - **Python packages**: `biopython`, `numpy`, `matplotlib` - **Optional tools**: MUSCLE or ClustalW installed locally (for `Bio.Align.Applications` wrappers) - **NCBI access**: Set `Entrez.email` before any E-utilities call; obtain a free API key at https://www.ncbi.nlm.nih.gov/account/ for 10 req/s (default is 3 req/s) - **Local BLAST**: BLAST+ installed separately (`conda install -c bioconda blast`) for offline searches > **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v python` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run python` rather than bare `python`. ```bash pip install biopython numpy matplotlib conda install -c bioconda blast # optional, for local BLAST ``` ## Quick Start ```python from Bio import SeqIO, Entrez from Bio.SeqUtils import gc_fraction # Fetch a GenBank record and display basic stats Entrez.email = "your.email@example.com" handle = Entrez.efetch(db="nucleotide", id="NM_007294", rettype="gb", retmode="text") record = SeqIO.read(handle, "genbank") handle.close() print(f"ID: {record.id}") print(f"Length: {len(record.seq)} bp") print(f"GC content: {gc_fraction(record.seq)*100:.1f}%") print(f"Features: {len(record.features)}") print(f"First feature: {record.features[0].type} at {record.features[0].location}") # ID: NM_007294.4 # Length: 7207 bp # GC content: 47.3% # Features: 9 ``` ## Core API ### Module 1: SeqIO — File Parsing and Format Conversion `Bio.SeqIO` reads and writes every major sequence format (FASTA, FASTQ, GenBank, EMBL, PHYLIP, Nexus). `SeqIO.parse()` returns an iterator of `SeqRecord` objects; `SeqIO.index()` builds an on-disk or in-memory dictionary for large files. ```python from Bio import SeqIO # Parse FASTA and FASTQ fasta_records = list(SeqIO.parse("sequences.fasta", "fasta")) print(f"FASTA: {len(fasta_records)} sequences") # FASTQ: access quality scores for rec in SeqIO.parse("reads.fastq", "fastq"): quals = rec.letter_annotations["phred_quality"] avg_q = sum(quals) / len(quals) print(f" {rec.id}: {len(rec.seq)} bp, mean Q={avg_q:.1f}") break # show one example # Parse GenBank with feature access for rec in SeqIO.parse("chromosome.gb", "genbank"): cdss = [f for f in rec.features if f.type == "CDS"] print(f"{rec.id}: {len(cdss)} CDS features") for cds in cdss[:3]: gene = cds.qualifiers.get("gene", ["unknown"])[0] print(f" {gene}: {cds.location}") ``` ```python from Bio import SeqIO # SeqIO.index() for random access without loading all records # Useful for large reference FASTA files (genomes, nr database subsets) idx = SeqIO.index("large_genome.fasta", "fasta") print(f"Index contains {len(idx)} sequences") # Retrieve specific sequences by ID in O(1) target = idx["chr1"] print(f"chr1: {len(target.seq):,} bp") region = target.seq[1_000_000:1_001_000] print(f"Region [1M-1M+1kb]: {region[:60]}...") # Format conversion: GenBank → FASTA in one call n = SeqIO.convert("annotation.gb", "genbank", "sequences.fasta", "fasta") print(f"Converted {n} records to FASTA") ``` ### Module 2: Seq and SeqRecord — Sequence Objects and Feature Annotations `Bio.Seq` represents a biological sequence with in-place operations. `Bio.SeqRecord` wraps a `Seq` with an ID, description, and a dictionary of feature annotations. Features use `FeatureLocation` with strand (+1, -1). ```python from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.SeqFeature import SeqFeature, FeatureLocation from Bio.SeqUtils import gc_fraction, MeltingTemp dna = Seq("ATGAAACCCGGGTTTTAA") print(f"GC content: {gc_fraction(dna)*100:.1f}%") print(f"Reverse complement: {dna.reverse_complement()}") print(f"Transcript: {dna.transcribe()}") print(f"Translation: {dna.translate()}") print(f"Translation (to stop): {dna.translate(to_stop=True)}") # Tm calculation for a short primer primer = Seq("ATGAAACCCGGG") tm = MeltingTemp.Tm_Wallace(primer) print(f"Tm (Wallace): {tm:.1f}°C") ``` ```python from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.SeqFeature import SeqFeature, FeatureLocation # Build a SeqRecord with annotated features gene_seq = Seq("ATGAAACCCGGGTTTTAAATCGATCG" * 10) record = SeqRecord(gene_seq, id="MY_GENE_001", description="synthetic example gene") # Annotate a CDS feature cds = SeqFeature( FeatureLocation(0, 18, strand=+1), type="CDS", qualifiers={"gene": ["myGene"], "product": ["hypothetical protein"]} ) record.features.append(cds) # Extract and translate the feature cds_seq = cds.location.extract(record.seq) protein = cds_seq.translate(to_stop=True) print(f"CDS: {cds_seq}") print(f"Protein: {protein}") # Slice record preserving feature annotations sub = record[0:60] print(f"Subrecord: {len(sub.seq)} bp, {len(sub.features)} features") ``` ### Module 3: Entrez — Programmatic NCBI Database Access `Bio.Entrez` wraps the NCBI E-utilities API: `esearch` finds records matching a query, `efetch` retrieves full records, `elink` finds related records across databases, and `esummary` returns document summaries. Always set `Entrez.email` and respect the 3 req/s rate limit (10 req/s with API key). ```python from Bio import Entrez, SeqIO import time Entrez.email = "your.email@example.com" # Entrez.api_key = "YOUR_API_KEY" # for 10 req/s # esearch: find IDs matching a query handle = Entrez.esearch(db="nucleotide", term="BRCA1[Gene] AND Homo sapiens[Organism] AND mRNA[Filter]", retmax=10) search_results = Entrez.read(handle) handle.close() print(f"Total hits: {search_results['Count']}") ids = search_results["IdList"] print(f"Retrieved IDs: {ids}") # efetch: download full GenBank records for acc_id in ids[:3]: handle = Entrez.efetch(db="nucleotide", id=acc_id, rettype="gb", retmode="text") record = SeqIO.read(handle, "genbank") handle.close() print(f" {record.id}: {len(record.seq)} bp — {record.description[:60]}") time.sleep(0.4) # stay within rate limit ``` ```python from Bio import Entrez import time Entrez.email = "your.email@example.com" # elink: cross-database links (e.g., PubMed article → related nucleotide sequences) handle = Entrez.elink(dbfrom="pubmed", db="nucleotide", id="29087512") link_results = Entrez.read(handle) handle.close() linked_ids = [] for linkset in link_results: for db_links in linkset.get("LinkSetDb", []): if db_links["DbTo"] == "nucleotide": linked_ids = [lnk["Id"] for lnk in db_links["Link"]] break print(f"Nucleotide sequences linked to PubMed 29087512: {len(linked_ids)}") print(f"First IDs: {linked_ids[:5]}") # esummary: lightweight metadata without downloading full records if linked_ids: handle = Entrez.esummary(db="nucleotide", id=",".join(linked_ids[:5])) summaries = Entrez.read(handle) handle.close() for doc in summaries: print(f" {doc['AccessionVersion']}: {doc['Title'][:70]}") ``` ### Module 4: BLAST — Remote and Local Sequence Similarity Search `Bio.Blast.NCBIWWW.qblast()` submits queries to NCBI BLAST servers and returns XML handles. `Bio.Blast.NCBIXML.parse()` (or `.read()` for single queries) yields `Blast` objects with `alignments` and `hsps`. For large-scale searches, use local BLAST+ via `subprocess`. ```python from Bio.Blast import NCBIWWW, NCBIXML from Bio.Seq import Seq # Remote BLASTP against Swiss-Prot (small, reviewed database) query = Seq("MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSY") result_handle = NCBIWWW.qblast("blastp", "swissprot", str(query), hitlist_size=10) # Parse results blast_record = NCBIXML.read(result_handle) print(f"Query: {blast_record.query}") print(f"Database: {blast_record.database}") print(f"Hits: {len(blast_record.alignments)}") E_VALUE_THRESH = 1e-5 for alignment in blast_record.alignments[:5]: for hsp in alignment.hsps: if hsp.expect < E_VALUE_THRESH: identity_pct = hsp.identities / hsp.align_length * 100 print(f"\n Hit: {alignment.title[:70]}") print(f" E-value: {hsp.expect:.2e}, Identity: {identity_pct:.1f}%, Score: {hsp.score}") print(f" Query: {hsp.query[:60]}") print(f" Match: {hsp.match[:60]}") print(f" Sbjct: {hsp.sbjct[:60]}") ``` ```python import subprocess from Bio.Blast import NCBIXML from Bio import SeqIO # Local BLAST+ via subprocess (faster for large batches) # Requires: makeblastdb and blastp installed (conda install -c bioconda blast) # Step 1: Build a local database from a FASTA file subprocess.run( ["makeblastdb", "-in", "ref_proteins.fasta", "-dbtype", "prot", "-out", "ref_db"], check=True ) # Step 2: Run blastp against local database result = subprocess.run( ["blastp", "-query", "query.fasta", "-db", "ref_db", "-outfmt", "5", # XML output for NCBIXML parsing "-evalue", "1e-5", "-num_threads", "4", "-out", "blast_results.xml"], check=True ) # Step 3: Parse XML results with open("blast_results.xml") as fh: for blast_rec in NCBIXML.parse(fh): print(f"Query: {blast_rec.query_id}") for aln in blast_rec.alignments[:3]: hsp = aln.hsps[0] print(f" Hit: {aln.hit_id}, E={hsp.expect:.2e}, Id={hsp.identities}/{hsp.align_length}") ``` ### Module 5: Phylo — Tree Parsing, Manipulation, and Visualization `Bio.Phylo` reads Newick, Nexus, PhyloXML, and NeXML formats. Trees are represented as `Clade` objects with nested children. Key methods: `find_clades()` (generator over nodes), `common_ancestor()`, `distance()`, `root_with_outgroup()`, and `prune()`. Visualization uses matplotlib. ```python from Bio import Phylo import io # Parse a Newick tree from a string newick = "((Homo_sapiens:0.01, Pan_troglodytes:0.012):0.08, (Mus_musculus:0.15, Rattus_norvegicus:0.14):0.12, Drosophila_melanogaster:0.85);" tree = Phylo.read(io.StringIO(newick), "newick") print(f"Number of terminals: {tree.count_terminals()}") print(f"Total branch length: {tree.total_branch_length():.3f}") print(f"Terminals: {[t.name for t in tree.get_terminals()]}") # Root with outgroup and compute distances tree.root_with_outgroup("Drosophila_melanogaster") human = tree.find_any("Homo_sapiens") chimp = tree.find_any("Pan_troglodytes") mouse = tree.find_any("Mus_musculus") print(f"\nDistance Human–Chimp: {tree.distance(human, chimp):.4f}") print(f"Distance Human–Mouse: {tree.distance(human, mouse):.4f}") # Traverse all internal nodes for clade in tree.find_clades(order="level"): if not clade.is_terminal(): children = [c.name or "internal" for c in clade.clades] print(f" Internal node → {children}") ``` ```python from Bio import Phylo import matplotlib.pyplot as plt import io newick = "((Homo_sapiens:0.01, Pan_troglodytes:0.012):0.08, (Mus_musculus:0.15, Rattus_norvegicus:0.14):0.12, Drosophila_melanogaster:0.85);" tree = Phylo.read(io.StringIO(newick), "newick") tree.root_with_outgroup("Drosophila_melanogaster") tree.ladderize() # sort clades by size for a clean layout # Annotate bootstrap support (if present in node labels) for clade in tree.find_clades(): if clade.confidence is not None: clade.name = f"{clade.confidence:.0f}" fig, ax = plt.subplots(figsize=(8, 5)) Phylo.draw(tree, axes=ax, do_show=False, show_confidence=True) ax.set_title("Primate + Outgroup Phylogeny") plt.tight_layout() plt.savefig("phylogeny.png", dpi=150, bbox_inches="tight") print("Saved phylogeny.png") ``` ### Module 6: PairwiseAligner — Pairwise Sequence Alignment `Bio.Align.PairwiseAligner` replaces the legacy `pairwise2` module (deprecated since Biopython 1.80). It supports local and global alignment, gap open/extend penalties, and any substitution matrix from `Bio.Align.substitution_matrices`. ```python from Bio.Align import PairwiseAligner, substitution_matrices # Global protein alignment with BLOSUM62 aligner = PairwiseAligner() aligner.mode = "global" aligner.substitution_matrix = substitution_matrices.load("BLOSUM62") aligner.open_gap_score = -11 aligner.extend_gap_score = -1 seq1 = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSY" seq2 = "MTEYKLVVVGAVGVGKSALTIQLIQNHFVDEYDPTIEDSY" # G12V variant alignments = aligner.align(seq1, seq2) best = alignments[0] print(f"Score: {best.score:.1f}") print(f"Number of alignments: {aligner.score(seq1, seq2):.0f} (score only, faster)") print(best) # formatted alignment string identity = sum(a == b for a, b in zip(*best) if a != "-") / best.shape[1] print(f"Identity: {identity*100:.1f}%") ``` ```python from Bio.Align import PairwiseAligner, substitution_matrices # Local DNA alignment aligner = PairwiseAligner() aligner.mode = "local" aligner.match_score = 2 aligner.mismatch_score = -1 aligner.open_gap_score = -5 aligner.extend_gap_score = -0.5 query = "ACGTACGTACGT" subject = "TTTTACGTACGTACGTTTTT" alignments = list(aligner.align(query, subject)) print(f"Local alignments found: {len(alignments)}") best = alignments[0] print(f"Score: {best.score}") print(f"Aligned region in subject: [{best.coordinates[1][0]}:{best.coordinates[1][-1]}]") print(best) # Batch pairwise identity matrix seqs = ["ACGTACGT", "ACGTATGT", "TTGTACGT", "ACGTACGG"] n = len(seqs) aligner2 = PairwiseAligner(mode="global", match_score=1, mismatch_score=-1) print("\nPairwise identity matrix:") for i in range(n): row = [] for j in range(n): score = aligner2.score(seqs[i], seqs[j]) max_len = max(len(seqs[i]), len(seqs[j])) row.append(f"{score/max_len:.2f}") print(" " + " ".join(row)) ``` ## Key Concepts ### SeqRecord and Feature Coordinates `SeqRecord` stores sequence metadata and a list of `SeqFeature` objects. Features use zero-based, half-open `FeatureLocation(start, end, strand)` — the same convention as Python slicing. Use `feature.location.extract(record.seq)` to obtain the strand-correct subsequence. Compound locations (e.g., spliced exons) use `CompoundLocation`. ```python from Bio import SeqIO # Extract all CDS protein translations from a GenBank record for rec in SeqIO.parse("gene.gb", "genbank"): for feat in rec.features: if feat.type == "CDS": gene = feat.qualifiers.get("gene", ["?"])[0] product = feat.qualifiers.get("product", ["unknown"])[0] cds_nt = feat.location.extract(rec.seq) protein = cds_nt.translate(to_stop=True) print(f"{gene} ({product}): {len(protein)} aa — {protein[:10]}...") ``` ### Phylo Clade Objects A `Clade` is both a node and the subtree rooted at that node. Terminal clades (leaves) have `.name` set; internal clades may have `.confidence` (bootstrap) and `.branch_length`. Use `tree.find_clades()` with `terminal=True/False` to filter. The root is `tree.root` (also a `Clade`). `Phylo.read()` returns a `Tree` wrapper; `tree.root` gives the root `Clade`. ## Common Workflows ### Workflow 1: Gene Family Phylogeny — NCBI Download → MUSCLE Alignment → Tree **Goal**: Download all BRCA1 orthologs from Vertebrata, align with MUSCLE, and build a neighbor-joining tree. ```python import subprocess import time from Bio import Entrez, SeqIO, Phylo, AlignIO from Bio.Align import MultipleSeqAlignment import matplotlib.pyplot as plt Entrez.email = "your.email@example.com" # Step 1: Search NCBI Protein for BRCA1 orthologs handle = Entrez.esearch( db="protein", term="BRCA1[Gene] AND Vertebrata[Organism] AND RefSeq[Filter]", retmax=20 ) results = Entrez.read(handle); handle.close() ids = results["IdList"] print(f"Found {results['Count']} sequences, using {len(ids)}") # Step 2: Fetch sequences in FASTA format handle = Entrez.efetch(db="protein", id=",".join(ids), rettype="fasta", retmode="text") with open("brca1_orthologs.fasta", "w") as fh: fh.write(handle.read()) handle.close() print(f"Saved {len(ids)} sequences to brca1_orthologs.fasta") time.sleep(1) # Step 3: Align with MUSCLE (must be installed: conda install -c bioconda muscle) subprocess.run( ["muscle", "-align", "brca1_orthologs.fasta", "-output", "brca1_aligned.fasta"], check=True ) alignment = AlignIO.read("brca1_aligned.fasta", "fasta") print(f"Alignment: {len(alignment)} sequences × {alignment.get_alignment_length()} columns") # Step 4: Build a neighbor-joining tree using Bio.Phylo + distance matrix from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor calculator = DistanceCalculator("identity") dm = calculator.get_distance(alignment) constructor = DistanceTreeConstructor(calculator, method="nj") tree = constructor.build_tree(alignment) # Step 5: Root and save tree.root_at_midpoint() tree.ladderize() Phylo.write(tree, "brca1_nj.nwk", "newick") print("Saved NJ tree to brca1_nj.nwk") # Step 6: Visualize fig, ax = plt.subplots(figsize=(10, 8)) Phylo.draw(tree, axes=ax, do_show=False) ax.set_title("BRCA1 Ortholog NJ Tree") plt.tight_layout() plt.savefig("brca1_tree.png", dpi=150, bbox_inches="tight") print("Saved brca1_tree.png") ``` ### Workflow 2: BLAST Pipeline — FASTA Query → Remote BLAST → Fetch Top Hits **Goal**: Take a query FASTA, BLAST against NCBI `nr`, filter significant hits, retrieve full GenBank records for the top matches, and write a summary CSV. ```python import csv import time from Bio import SeqIO, Entrez from Bio.Blast import NCBIWWW, NCBIXML Entrez.email = "your.email@example.com" # Step 1: Load query sequence query_record = SeqIO.read("query.fasta", "fasta") print(f"Query: {query_record.id} ({len(query_record.seq)} aa)") # Step 2: Remote BLAST against nr protein database print("Submitting BLAST job (may take 1-5 minutes)...") result_handle = NCBIWWW.qblast( "blastp", "nr", str(query_record.seq), hitlist_size=20, expect=1e-5, word_size=6, matrix_name="BLOSUM62" ) # Step 3: Parse results and filter E_THRESH = 1e-5 MIN_IDENTITY = 50.0 top_hits = [] blast_record = NCBIXML.read(result_handle) for alignment in blast_record.alignments: for hsp in alignment.hsps: if hsp.expect > E_THRESH: continue identity_pct = hsp.identities / hsp.align_length * 100 if identity_pct < MIN_IDENTITY: continue accession = alignment.accession top_hits.append({ "accession": accession, "title": alignment.title[:80], "length": alignment.length, "score": hsp.score, "evalue": hsp.expect, "identity_pct": round(identity_pct, 1), "coverage": round(hsp.align_length / blast_record.query_length * 100, 1), }) print(f"Filtered to {len(top_hits)} significant hits") # Step 4: Fetch full GenBank records for top 5 hits accessions = [h["accession"] for h in top_hits[:5]] time.sleep(1) handle = Entrez.efetch(db="protein", id=",".join(accessions), rettype="gb", retmode="text") gb_records = list(SeqIO.parse(handle, "genbank")) handle.close() SeqIO.write(gb_records, "top_blast_hits.gb", "genbank") print(f"Saved {len(gb_records)} full GenBank records to top_blast_hits.gb") # Step 5: Write CSV summary with open("blast_summary.csv", "w", newline="") as fh: writer = csv.DictWriter(fh, fieldnames=top_hits[0].keys()) writer.writeheader() writer.writerows(top_hits) print(f"Saved blast_summary.csv ({len(top_hits)} hits)") ``` ### Workflow 3: FASTQ Quality Filter and Format Pipeline **Goal**: Read a FASTQ file, filter reads by mean quality and length, and output a FASTA for downstream alignment. ```python from Bio import SeqIO from Bio.SeqUtils import gc_fraction input_fastq = "raw_reads.fastq" output_fasta = "filtered_reads.fasta" MIN_QUAL = 20 # mean Phred quality MIN_LEN = 100 # minimum read length MAX_LEN = 300 # maximum read length def passes_qc(record, min_q=MIN_QUAL, min_l=MIN_LEN, max_l=MAX_LEN): quals = record.letter_annotations["phred_quality"] mean_q = sum(quals) / len(quals) length = len(record.seq) return mean_q >= min_q and min_l <= length <= max_l kept = 0 total = 0 with open(output_fasta, "w") as out_fh: for record in SeqIO.parse(input_fastq, "fastq"): total += 1 if passes_qc(record): # Convert FASTQ → FASTA (drops quality scores) SeqIO.write(record, out_fh, "fasta") kept += 1 print(f"Total reads: {total:,}") print(f"Passed QC: {kept:,} ({kept/total*100:.1f}%)") print(f"Saved to: {output_fasta}") ``` ## Key Parameters | Parameter | Module / Function | Default | Range / Options | Effect | |-----------|------------------|---------|-----------------|--------| | `retmax` | `Entrez.esearch()` | `20` | `1`–`10000` | Max IDs returned per search; use history server for >10000 | | `hitlist_size` | `NCBIWWW.qblast()` | `50` | `1`–`5000` | Max BLAST alignments returned | | `expect` | `NCBIWWW.qblast()` | `10.0` | `1e-100`–`1000` | E-value cutoff for BLAST hit reporting | | `matrix_name` | `NCBIWWW.qblast()` | `"BLOSUM62"` | `"BLOSUM45"`, `"BLOSUM80"`, `"PAM250"` | Substitution matrix; BLOSUM62 for general use | | `mode` | `PairwiseAligner` | `"global"` | `"global"`, `"local"` | Needleman-Wunsch (global) vs Smith-Waterman (local) | | `open_gap_score` | `PairwiseAligner` | `-1` | Negative float | Penalty for opening a gap; increase magnitude to penalize more | | `extend_gap_score` | `PairwiseAligner` | `0` | Negative float | Per-residue gap extension penalty | | `method` | `DistanceTreeConstructor` | `"nj"` | `"nj"`, `"upgma"` | NJ (unrooted) vs UPGMA (rooted, assumes clock) | | `format` | `Phylo.read/write()` | required | `"newick"`, `"nexus"`, `"phyloxml"` | Tree file format | | `rettype` | `Entrez.efetch()` | required | `"fasta"`, `"gb"`, `"xml"` | Record format returned; pair with matching SeqIO format string | ## Best Practices 1. **Always set `Entrez.email`** and respect rate limits. NCBI blocks IPs that exceed 3 req/s without a key. Add `time.sleep(0.4)` between requests in loops, or use `Entrez.api_key` for 10 req/s. ```python from Bio import Entrez import time Entrez.email = "your.email@institution.edu" Entrez.api_key = "YOUR_API_KEY" # from https://www.ncbi.nlm.nih.gov/account/ # In any batch loop: time.sleep(0.11) # ~9 req/s with key ``` 2. **Use `SeqIO.index()` for large FASTA files** instead of `list(SeqIO.parse(...))`. Loading all records into a list consumes O(N) memory; `index()` reads only the byte offsets and fetches records on demand. ```python # Bad for large files: loads everything into RAM # records = {r.id: r for r in SeqIO.parse("genome.fasta", "fasta")} # Good: on-disk index, O(1) access from Bio import SeqIO idx = SeqIO.index("genome.fasta", "fasta") rec = idx["chr22"] # fetches only this record ``` 3. **Use `PairwiseAligner` not the legacy `pairwise2`**. `Bio.pairwise2` is deprecated since Biopython 1.80 and will be removed. `PairwiseAligner` is faster, supports substitution matrices directly, and returns `Alignment` objects with coordinate arrays. 4. **Close Entrez handles immediately** after reading. Handles are HTTP connections; leaving them open risks timeouts and resource exhaustion. ```python handle = Entrez.efetch(db="nucleotide", id="NM_007294", rettype="gb", retmode="text") record = SeqIO.read(handle, "genbank") handle.close() # do not skip this ``` 5. **Prefer local BLAST for batch searches**. Remote `NCBIWWW.qblast()` is suitable for ad hoc queries but can queue for minutes on NCBI servers. For screening >100 sequences, build a local BLAST+ database with `makeblastdb` and call `blastp`/`blastn` via `subprocess` with `-outfmt 5` (XML) for `NCBIXML` parsing. 6. **Root trees before measuring distances**. `Phylo.distance()` measures the sum of branch lengths along the path between two nodes. On an unrooted tree, the path is still unique, but midpoint-rooting or outgroup-rooting makes biological sense for visualizations and clade assertions. 7. **Strip alignment gaps before building SeqRecord collections**. BLAST and alignment results may include gap characters (`-`). `Seq` operations like `.translate()` will raise errors on gapped sequences; strip with `seq.replace("-", "")` or use `ungap()`. ## Common Recipes ### Recipe: Batch Entrez Fetch with History Server When to use: Downloading more than 500 records from NCBI — avoids URL length limits and keeps search results server-side. ```python from Bio import Entrez, SeqIO import time Entrez.email = "your.email@example.com" # Search and store results on NCBI history server handle = Entrez.esearch(db="nucleotide", term="16S rRNA[Gene] AND Bacteria[Organism]", retmax=500, usehistory="y") results = Entrez.read(handle); handle.close() webenv = results["WebEnv"] query_key = results["QueryKey"] count = int(results["Count"]) print(f"Total: {count} sequences") # Fetch in batches of 200 batch_size = 200 all_records = [] for start in range(0, min(count, 1000), batch_size): handle = Entrez.efetch( db="nucleotide", rettype="fasta", retmode="text", retstart=start, retmax=batch_size, webenv=webenv, query_key=query_key ) batch = list(SeqIO.parse(handle, "fasta")) handle.close() all_records.extend(batch) print(f" Fetched {len(all_records)}/{min(count, 1000)}") time.sleep(0.5) SeqIO.write(all_records, "16S_sequences.fasta", "fasta") print(f"Saved {len(all_records)} sequences") ``` ### Recipe: Parse GFF3 with Sequence Features When to use: Extract gene/CDS sequences from a GFF3 annotation paired with a reference FASTA. ```python from Bio import SeqIO # Load genome FASTA into indexed dict genome = SeqIO.to_dict(SeqIO.parse("genome.fasta", "fasta")) # Parse GFF3 manually (Biopython does not have a native GFF3 parser; # use the gffutils package for complex queries, or parse directly for simple cases) genes = [] with open("annotation.gff3") as fh: for line in fh: if line.startswith("#") or not line.strip(): continue fields = line.strip().split("\t") if len(fields) < 9 or fields[2] != "CDS": continue chrom, _, feat_type, start, end, _, strand, _, attrs = fields start, end = int(start) - 1, int(end) # GFF3 is 1-based if chrom not in genome: continue seq = genome[chrom].seq[start:end] if strand == "-": seq = seq.reverse_complement() attr_dict = dict(kv.split("=") for kv in attrs.strip().split(";") if "=" in kv) gene_id = attr_dict.get("gene_id", attr_dict.get("ID", "unknown")) genes.append((gene_id, seq, strand)) print(f"Extracted {len(genes)} CDS features") for gid, seq, strand in genes[:3]: print(f" {gid} ({strand}): {len(seq)} bp — protein: {seq.translate(to_stop=True)[:8]}...") ``` ### Recipe: Compute a Pairwise Identity Matrix from a Multiple Alignment When to use: Quickly assess sequence diversity within an alignment file (FASTA, PHYLIP, Clustal) and identify outlier sequences. ```python from Bio import AlignIO import numpy as np alignment = AlignIO.read("aligned_sequences.fasta", "fasta") n = len(alignment) names = [rec.id for rec in alignment] length = alignment.get_alignment_length() # Build identity matrix identity_matrix = np.zeros((n, n)) for i in range(n): for j in range(n): if i == j: identity_matrix[i, j] = 1.0 continue matches = sum( a == b and a != "-" for a, b in zip(str(alignment[i].seq), str(alignment[j].seq)) ) aligned_cols = sum(a != "-" and b != "-" for a, b in zip(str(alignment[i].seq), str(alignment[j].seq))) identity_matrix[i, j] = matches / aligned_cols if aligned_cols else 0.0 print("Pairwise identity matrix:") print(f"{'':20s} " + " ".join(f"{n[:8]:>8s}" for n in names)) for i, name in enumerate(names): row = " ".join(f"{identity_matrix[i,j]*100:8.1f}" for j in range(n)) print(f"{name[:20]:20s} {row}") # Find most divergent pair min_id = np.min(identity_matrix[identity_matrix > 0]) idx = np.unravel_index(np.argmin(np.where(identity_matrix > 0, identity_matrix, 1)), identity_matrix.shape) print(f"\nMost divergent pair: {names[idx[0]]} vs {names[idx[1]]} ({min_id*100:.1f}%)") ``` ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `urllib.error.HTTPError: 429 Too Many Requests` | Exceeding NCBI rate limit (3 req/s) | Add `time.sleep(0.4)` between calls, or register a free API key at https://www.ncbi.nlm.nih.gov/account/ for 10 req/s | | `RuntimeError: Too many requests were made without pausing` | NCBI detects rapid-fire requests | Set `Entrez.email` (required), reduce loop frequency, batch IDs into comma-joined strings in a single `efetch` call | | `Bio.Application.ApplicationError: blastall not found` | Legacy `blastall` replaced by BLAST+ | Replace `NcbiblastpCommandline` with `subprocess.run(["blastp", ...])` and BLAST+ tools | | `AttributeError: 'PairwiseAligner' object has no attribute 'align'` | Biopython < 1.78 | Upgrade: `pip install --upgrade biopython` (PairwiseAligner requires ≥ 1.72; stable from 1.78) | | `Bio.pairwise2` deprecation warning | Using the legacy pairwise2 API | Replace with `from Bio.Align import PairwiseAligner` (removed in Biopython 1.84+) | | `ValueError: Sequence contains letters not in the alphabet` | Non-standard characters (e.g., `N`, ambiguity codes) in sequence before `translate()` | Strip or replace ambiguous bases; use `seq.translate(table=1)` which handles ambiguous codons | | Empty BLAST results / `StopIteration` from `NCBIXML.read()` | Empty or malformed XML; network timeout; query too short | Check query sequence length (>10 aa recommended); switch from `.read()` to `.parse()` and check `blast_record.alignments` length | | `Phylo.draw()` hangs or produces blank plot | Missing matplotlib backend | Call `import matplotlib; matplotlib.use("Agg")` before importing Phylo for headless environments | | `TreeConstruction` distance matrix dimension mismatch | Alignment contains duplicate IDs | Deduplicate SeqRecord IDs before constructing the alignment: `{r.id: r for r in records}.values()` | ## Related Skills - **biopython-molecular-biology** — restriction digestion, PCR primer design, protein structure analysis (Bio.PDB), molecular weight and Tm calculations; complement to this skill - **pysam-genomic-files** — SAM/BAM/CRAM alignment file access, pileup, and region queries; use when working with read alignments - **scikit-bio** — ecological diversity statistics (UniFrac, Bray-Curtis), ordination (PCoA), and microbiome analysis; more specialized than Biopython's phylogenetics for diversity analyses - **gget-genomic-databases** — one-liner gene lookups, BLAST, and database queries without setting up an Entrez pipeline - **etetoolkit** — advanced phylogenetic tree visualization, annotation with NCBI taxonomy, and publication-quality tree rendering ## References - [Biopython Tutorial and Cookbook](https://biopython.org/DIST/docs/tutorial/Tutorial.html) — comprehensive official tutorial (Chapters 2, 5, 6, 7, 9, 11) - [Biopython API Reference](https://biopython.org/docs/latest/api/) — complete module and class documentation - [GitHub: biopython/biopython](https://github.com/biopython/biopython) — source, changelog, and issue tracker - [NCBI E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25499/) — Entrez API reference for esearch, efetch, elink, esummary - [PairwiseAligner Migration Guide](https://biopython.org/docs/latest/api/Bio.Align.html) — replacing the deprecated `Bio.pairwise2` module