--- name: scikit-bio description: Python bioinformatics library for sequence manipulation, alignments, phylogenetics, diversity metrics (Shannon, UniFrac), ordination (PCoA, CCA), statistical tests (PERMANOVA, Mantel), and biological file format I/O. --- # scikit-bio A Python library for biological data analysis spanning sequence handling, phylogenetics, microbial ecology, and multivariate statistics. ## When to Apply Use this skill when users need to: | Task Category | Examples | |--------------|----------| | Sequence work | DNA/RNA/protein manipulation, motif finding, translation | | File handling | FASTA, FASTQ, GenBank, Newick, BIOM I/O | | Alignments | Pairwise or multiple sequence alignment | | Phylogenetics | Tree construction, manipulation, distance calculations | | Diversity metrics | Alpha diversity (Shannon, Faith's PD), beta diversity (Bray-Curtis, UniFrac) | | Ordination | PCoA, CCA, RDA for dimensionality reduction | | Statistical tests | PERMANOVA, ANOSIM, Mantel tests | | Microbiome analysis | Feature tables, rarefaction, community comparisons | ## Installation ```bash uv pip install scikit-bio ``` ## Sequences Work with biological sequences through specialized `DNA`, `RNA`, and `Protein` classes. ```python import skbio # Load from file seq = skbio.DNA.read('gene.fasta') # Common operations complement = seq.reverse_complement() messenger = seq.transcribe() peptide = messenger.translate() # Pattern search hits = seq.find_with_regex('ATG[ACGT]{6}TAA') # Properties contains_ambiguous = seq.has_degenerates() clean_seq = seq.degap() ``` **Metadata types:** - Sequence-level: ID, description, source organism - Positional: Per-base quality scores (from FASTQ) - Interval: Feature annotations, gene boundaries ## Sequence Alignment Pairwise and multiple alignment using dynamic programming. ```python from skbio.alignment import local_pairwise_align_ssw, TabularMSA # Local alignment (Smith-Waterman) result = local_pairwise_align_ssw(query_seq, target_seq) # Load existing alignment alignment = TabularMSA.read('msa.fasta', constructor=skbio.DNA) # Derive consensus consensus_seq = alignment.consensus() ``` **Notes:** - `local_pairwise_align_ssw` provides fast SSW-based local alignment - `StripedSmithWaterman` handles protein sequences with substitution matrices - Affine gap penalties suit biological sequences best ## Phylogenetic Trees Construct and analyze evolutionary trees. ```python from skbio import TreeNode from skbio.tree import nj, upgma # Build from distances phylogeny = nj(distance_matrix) # Load existing tree phylogeny = TreeNode.read('species.nwk') # Extract subset clade = phylogeny.shear(['mouse', 'rat', 'human']) # Enumerate leaf nodes leaves = list(phylogeny.tips()) # Common ancestor ancestor = phylogeny.lowest_common_ancestor(['mouse', 'rat']) # Branch length between taxa branch_dist = phylogeny.find('mouse').distance(phylogeny.find('rat')) # Pairwise distances for all tips pairwise_dm = phylogeny.cophenetic_matrix() # Topology comparison rf_diff = phylogeny.robinson_foulds(other_tree) ``` **Tree construction methods:** | Method | Use case | |--------|----------| | `nj()` | Standard neighbor-joining | | `upgma()` | Assumes molecular clock | | `bme()` | Scalable for large datasets | ## Diversity Analysis Calculate ecological diversity metrics. ### Alpha Diversity (within-sample) ```python from skbio.diversity import alpha_diversity # Sample abundance matrix abundances = np.array([ [45, 12, 0, 8], [5, 0, 33, 17], [20, 20, 15, 10] ]) samples = ['gut_1', 'gut_2', 'gut_3'] # Richness and evenness metrics shannon_vals = alpha_diversity('shannon', abundances, ids=samples) simpson_vals = alpha_diversity('simpson', abundances, ids=samples) # Phylogenetic diversity (requires tree) faith_vals = alpha_diversity('faith_pd', abundances, ids=samples, tree=phylogeny, otu_ids=feature_names) ``` ### Beta Diversity (between-sample) ```python from skbio.diversity import beta_diversity # Distance matrices bray_dm = beta_diversity('braycurtis', abundances, ids=samples) unifrac_dm = beta_diversity('weighted_unifrac', abundances, ids=samples, tree=phylogeny, otu_ids=feature_names) ``` **Key points:** - Input must be integer counts, not proportions - Phylogenetic metrics require a tree matching feature IDs - `partial_beta_diversity()` computes specific sample pairs efficiently ## Ordination Project high-dimensional data to visualizable spaces. ```python from skbio.stats.ordination import pcoa, cca # PCoA from distance matrix coords = pcoa(bray_dm) axis1 = coords.samples['PC1'] axis2 = coords.samples['PC2'] variance_explained = coords.proportion_explained # CCA with environmental predictors constrained = cca(species_abundances, environmental_vars) ``` **Methods:** | Function | Input | Purpose | |----------|-------|---------| | `pcoa()` | Distance matrix | Unconstrained ordination | | `cca()` | Abundance + environment | Constrained ordination (unimodal) | | `rda()` | Abundance + environment | Constrained ordination (linear) | ## Statistical Tests Hypothesis testing for ecological data. ```python from skbio.stats.distance import permanova, anosim, mantel # Group comparison treatment_groups = ['control', 'control', 'treated', 'treated'] perm_result = permanova(bray_dm, treatment_groups, permutations=999) print(f"F = {perm_result['test statistic']:.3f}, p = {perm_result['p-value']:.4f}") # Alternative group test anos_result = anosim(bray_dm, treatment_groups, permutations=999) # Matrix correlation r, pval, n = mantel(genetic_dm, geographic_dm, method='spearman', permutations=999) print(f"r = {r:.3f}, p = {pval:.4f}") ``` **Test overview:** | Test | Purpose | Key output | |------|---------|------------| | PERMANOVA | Group differences | F-statistic, p-value | | ANOSIM | Group differences (alternative) | R-statistic, p-value | | PERMDISP | Dispersion homogeneity | Tests PERMANOVA assumption | | Mantel | Matrix correlation | Correlation coefficient, p-value | ## File I/O Read and write 19+ biological formats. ```python import skbio # Automatic format detection tree = skbio.TreeNode.read('phylogeny.nwk') # Memory-efficient iteration for record in skbio.io.read('reads.fastq', format='fastq', constructor=skbio.DNA): if record.positional_metadata['quality'].mean() > 30: process(record) # Format conversion records = skbio.io.read('sequences.fastq', format='fastq', constructor=skbio.DNA) skbio.io.write(records, format='fasta', into='sequences.fasta') ``` **Supported formats:** | Category | Formats | |----------|---------| | Sequences | FASTA, FASTQ, GenBank, EMBL, QSeq | | Alignments | Clustal, PHYLIP, Stockholm | | Trees | Newick | | Tables | BIOM (HDF5/JSON) | | Distances | Delimited matrices | ## Distance Matrices Store and manipulate pairwise distances. ```python from skbio import DistanceMatrix import numpy as np # Create from array distances = np.array([ [0.0, 0.3, 0.7], [0.3, 0.0, 0.5], [0.7, 0.5, 0.0] ]) dm = DistanceMatrix(distances, ids=['sp_A', 'sp_B', 'sp_C']) # Access elements pair_dist = dm['sp_A', 'sp_B'] all_from_a = dm['sp_A'] # Subset subset_dm = dm.filter(['sp_A', 'sp_C']) ``` ## Feature Tables (BIOM) Handle OTU/ASV abundance tables. ```python from skbio import Table # Load table tbl = Table.read('features.biom') # Inspect structure sample_names = tbl.ids(axis='sample') feature_names = tbl.ids(axis='observation') # Filter by abundance filtered = tbl.filter(lambda row, id_, md: row.sum() > 500, axis='sample') # Convert to pandas df = tbl.to_dataframe() ``` ## Protein Embeddings Bridge language model outputs with scikit-bio analysis. ```python from skbio.embedding import ProteinEmbedding # Load embeddings (from ESM, ProtTrans, etc.) emb = ProteinEmbedding(embedding_matrix, protein_ids) # Create distance matrix for downstream analysis emb_dm = emb.to_distances(metric='cosine') # Ordination visualization emb_pcoa = emb.to_ordination(metric='euclidean', method='pcoa') ``` ## Typical Workflows **Microbiome diversity study:** 1. Load BIOM table and phylogenetic tree 2. Calculate alpha diversity per sample 3. Compute beta diversity (UniFrac) 4. Ordinate with PCoA 5. Test group differences with PERMANOVA **Phylogenetic inference:** 1. Read sequences from FASTA 2. Perform multiple alignment 3. Calculate pairwise distances 4. Construct tree with neighbor-joining 5. Analyze clade relationships **Sequence processing:** 1. Read FASTQ with quality scores 2. Filter low-quality reads 3. Search for motifs 4. Translate to protein 5. Export as FASTA ## Performance Tips - Use generators for large sequence files - Prefer BIOM HDF5 over JSON for big tables - Apply `partial_beta_diversity()` when computing only specific pairs - Choose BME for very large phylogenies ## Ecosystem Integration | Library | Integration | |---------|-------------| | pandas | DataFrames from distance matrices, diversity results | | numpy | Array conversions throughout | | matplotlib/seaborn | Plot ordination results, heatmaps | | scikit-learn | Distance matrices as input | | QIIME 2 | Native BIOM, tree, distance matrix compatibility | ## Reference Files | File | Contents | |------|----------| | [references/api-reference.md](references/api-reference.md) | Complete method signatures, parameters, extended examples, and troubleshooting |