--- name: scikit-bio description: Biological data toolkit. Sequence analysis, alignments, phylogenetic trees, diversity metrics (alpha/beta, UniFrac), ordination (PCoA), PERMANOVA, FASTA/Newick I/O, for microbiome analysis. license: BSD-3-Clause license allowed-tools: Read Write Edit Bash compatibility: Requires Python 3.10+ and scikit-bio 0.7+ (uv pip install scikit-bio). NumPy 2.0+ is required. Optional matplotlib/seaborn/plotly for plotting; biom-format for BIOM tables; polars/anndata for table interoperability. metadata: version: "1.1" skill-author: K-Dense Inc. --- # scikit-bio ## Overview scikit-bio is a comprehensive Python library for working with biological data. Apply this skill for bioinformatics analyses spanning sequence manipulation, alignment, phylogenetics, microbial ecology, and multivariate statistics. ## When to Use This Skill This skill should be used when the user: - Works with biological sequences (DNA, RNA, protein) - Needs to read/write biological file formats (FASTA, FASTQ, GenBank, Newick, BIOM, etc.) - Performs sequence alignments or searches for motifs - Constructs or analyzes phylogenetic trees - Calculates diversity metrics (alpha/beta diversity, UniFrac distances) - Performs ordination analysis (PCoA, CCA, RDA) - Runs statistical tests on biological/ecological data (PERMANOVA, ANOSIM, Mantel) - Analyzes microbiome or community ecology data - Works with protein embeddings from language models - Needs to manipulate biological data tables ## Core Capabilities ### 1. Sequence Manipulation Work with biological sequences using specialized classes for DNA, RNA, and protein data. **Key operations:** - Read/write sequences from FASTA, FASTQ, GenBank, EMBL formats - Sequence slicing, concatenation, and searching - Reverse complement, transcription (DNA→RNA), and translation (RNA→protein) - Find motifs and patterns using regex - Calculate distances (Hamming, k-mer based) - Handle sequence quality scores and metadata **Common patterns:** ```python import skbio # Read sequences from file seq = skbio.DNA.read('input.fasta') # Sequence operations rc = seq.reverse_complement() rna = seq.transcribe() protein = rna.translate() # Find motifs motif_positions = seq.find_with_regex('ATG[ACGT]{3}') # Check for properties has_degens = seq.has_degenerates() seq_no_gaps = seq.degap() ``` **Important notes:** - Use `DNA`, `RNA`, `Protein` classes for grammared sequences with validation - Use `Sequence` class for generic sequences without alphabet restrictions - Quality scores automatically loaded from FASTQ files into positional metadata - Metadata types: sequence-level (ID, description), positional (per-base), interval (regions/features) ### 2. Sequence Alignment Perform pairwise and multiple sequence alignments using the `pair_align` engine (introduced in scikit-bio 0.7.0), a versatile and efficient dynamic-programming aligner. **Key capabilities:** - Global, local, and semi-global alignment (free ends configurable) in one function - Convenience wrappers `pair_align_nucl` (BLASTN-like) and `pair_align_prot` (BLASTP-like) - Configurable scoring: match/mismatch tuple or named substitution matrix; linear or affine gap penalties - `PairAlignPath` results carry CIGAR strings and convert to aligned sequences - Multiple sequence alignment storage and manipulation with `TabularMSA` **Common patterns:** ```python from skbio import DNA, Protein from skbio.alignment import pair_align_nucl, pair_align_prot, pair_align, TabularMSA # Nucleotide alignment with BLASTN-like defaults seq1, seq2 = DNA('ACTACCAGATTACTTACGGATCAGG'), DNA('CGAAACTACTAGATTACGGATCTTA') aln = pair_align_nucl(seq1, seq2) aln.score # alignment score (float) path = aln.paths[0] # PairAlignPath (repr shows CIGAR) aligned_seqs = path.to_aligned((seq1, seq2)) # list of gapped strings # Build a TabularMSA from the alignment path + original sequences msa = TabularMSA.from_path_seqs(path, (seq1, seq2)) # Customize the algorithm via pair_align (default mode='global') aln = pair_align(seq1, seq2, mode='local') # Smith-Waterman aln = pair_align(seq1, seq2, sub_score=(2, -3), gap_cost=(5, 2)) # affine gaps aln = pair_align(seq1, seq2, sub_score='NUC.4.4', gap_cost=3) # substitution matrix, linear gap # Protein alignment (BLASTP-like, BLOSUM62) aln = pair_align_prot(Protein('HEAGAWGHEE'), Protein('PAWHEAE')) # Read a multiple alignment from file and summarize msa = TabularMSA.read('alignment.fasta', constructor=DNA) consensus = msa.consensus() ``` **Important notes:** - `pair_align` replaces the removed SSW wrapper (`local_pairwise_align_ssw`, `StripedSmithWaterman`) and the deprecated pure-Python aligners (`global_pairwise_align`, `local_pairwise_align_nucleotide`, etc.) - The result is a `PairAlignResult` that also unpacks as `score, paths, matrices` (use `keep_matrices=True` to retain the DP matrix) - `sub_score` accepts a `(match, mismatch)` tuple or a matrix name (e.g., `'NUC.4.4'`, `'BLOSUM62'`); `gap_cost` accepts a single number (linear) or `(open, extend)` tuple (affine) - Parse external CIGAR strings with `PairAlignPath.from_cigar('1I8M2D5M2I')`; score an existing alignment with `align_score(...)` and build a distance matrix from an MSA with `align_dists(...)` ### 3. Phylogenetic Trees Construct, manipulate, and analyze phylogenetic trees representing evolutionary relationships. **Key capabilities:** - Tree construction from distance matrices (UPGMA/WPGMA, Neighbor Joining, GME, BME) - Tree rearrangement with nearest neighbor interchange (`nni`) - Tree manipulation (pruning, rerooting, traversal) - Distance calculations (patristic via `cophenet`, Robinson-Foulds via `compare_rfd`) - ASCII visualization - Newick format I/O **Common patterns:** ```python from skbio import TreeNode from skbio.tree import nj, upgma, gme, bme, rf_dists # Read tree from file tree = TreeNode.read('tree.nwk') # Construct tree from distance matrix tree = nj(distance_matrix) # Tree operations subtree = tree.shear(['taxon1', 'taxon2', 'taxon3']) tips = [node for node in tree.tips()] lca = tree.lca(['taxon1', 'taxon2']) # Calculate distances patristic_dist = tree.find('taxon1').distance(tree.find('taxon2')) cophenetic_dm = tree.cophenet() # patristic distance matrix among tips # Compare two trees (Robinson-Foulds) rf_distance = tree.compare_rfd(other_tree) # Pairwise RF distances among many trees -> DistanceMatrix rf_dm = rf_dists([tree, other_tree, third_tree]) ``` **Important notes:** - Use `nj()` for neighbor joining (classic phylogenetic method) - Use `upgma()` for UPGMA/WPGMA (assumes molecular clock) - GME and BME are highly scalable for large trees; refine topology with `nni()` - `cophenet()` (formerly `tip_tip_distances`) returns the patristic distance matrix; `compare_rfd()` is the Robinson-Foulds method (`compare_wrfd`/`compare_cophenet` for weighted/cophenetic variants) - `lca()` is the lowest common ancestor; `lowest_common_ancestor` remains as an alias - Trees can be rooted or unrooted; some metrics require specific rooting ### 4. Diversity Analysis Calculate alpha and beta diversity metrics for microbial ecology and community analysis. **Key capabilities:** - Alpha diversity: richness (`sobs`, `observed_features`, `chao1`, `ace`), Shannon, Simpson, Hill numbers (`hill`), Faith's PD (`faith_pd`), generalized PD (`phydiv`), Pielou's evenness - Beta diversity: Bray-Curtis, Jaccard, weighted/unweighted UniFrac, Euclidean distances - Phylogenetic diversity metrics (require tree input) - Rarefaction and subsampling - Integration with ordination and statistical tests **Common patterns:** ```python from skbio.diversity import alpha_diversity, beta_diversity # Alpha diversity (phylogenetic metrics take taxa= for tip-name mapping) alpha = alpha_diversity('shannon', counts_matrix, ids=sample_ids) faith_pd = alpha_diversity('faith_pd', counts_matrix, ids=sample_ids, tree=tree, taxa=feature_ids) # Beta diversity bc_dm = beta_diversity('braycurtis', counts_matrix, ids=sample_ids) unifrac_dm = beta_diversity('unweighted_unifrac', counts_matrix, ids=sample_ids, tree=tree, taxa=feature_ids) # Get available metrics from skbio.diversity import get_alpha_diversity_metrics print(get_alpha_diversity_metrics()) ``` **Important notes:** - Counts must be integers representing abundances, not relative frequencies - The phylogenetic-metric argument is `taxa=` (renamed from `otu_ids` in 0.6.0; the old name is a deprecated alias); `observed_otus` is now `observed_features` (or `sobs`) - `counts_matrix` may be any table-like input (NumPy array, pandas/polars DataFrame, BIOM `Table`, or AnnData) via the dispatch system - Phylogenetic metrics (Faith's PD, UniFrac) require tree and taxa-to-tip mapping - Use `partial_beta_diversity()` for specific sample pairs, or `block_beta_diversity()` for large block-decomposed calculations - Alpha diversity returns a `pandas.Series`, beta diversity returns a `DistanceMatrix` ### 5. Ordination Methods Reduce high-dimensional biological data to visualizable lower-dimensional spaces. **Key capabilities:** - PCoA (Principal Coordinate Analysis) from distance matrices - CA (Correspondence Analysis) for contingency tables - CCA (Canonical Correspondence Analysis) with environmental constraints - RDA (Redundancy Analysis) for linear relationships - Biplot projection for feature interpretation **Common patterns:** ```python from skbio.stats.ordination import pcoa, cca import skbio # PCoA from distance matrix (limit dimensions for large matrices) pcoa_results = pcoa(distance_matrix, dimensions=3) pc1 = pcoa_results.samples['PC1'] pc2 = pcoa_results.samples['PC2'] # Built-in scatter plot colored by a metadata column fig = pcoa_results.plot(sample_metadata, column='bodysite') # CCA with environmental variables cca_results = cca(species_matrix, environmental_matrix) # Save/load ordination results pcoa_results.write('ordination.txt') results = skbio.OrdinationResults.read('ordination.txt') ``` **Important notes:** - PCoA works with any distance/dissimilarity matrix; pass `dimensions` as an int (count) or a float in (0, 1] (fraction of cumulative variance to retain) - `OrdinationResults` exposes pandas-based attributes: `samples`, `features`, `eigvals`, `proportion_explained`, `biplot_scores`, `sample_constraints` - CCA reveals environmental drivers of community composition - `OrdinationResults.plot()` produces a matplotlib figure; results also integrate with seaborn/plotly ### 6. Statistical Testing Perform hypothesis tests specific to ecological and biological data. **Key capabilities:** - PERMANOVA: test group differences using distance matrices - ANOSIM: alternative test for group differences - PERMDISP: test homogeneity of group dispersions - Mantel test: correlation between distance matrices - Bioenv: find environmental variables correlated with distances - Differential abundance: `ancom`, `dirmult_ttest`, and `dirmult_lme` (longitudinal mixed-effects) in `skbio.stats.composition` **Common patterns:** ```python from skbio.stats.distance import permanova, anosim, mantel # Test if groups differ significantly permanova_results = permanova(distance_matrix, grouping, permutations=999) print(f"p-value: {permanova_results['p-value']}") # ANOSIM test anosim_results = anosim(distance_matrix, grouping, permutations=999) # Mantel test between two distance matrices mantel_results = mantel(dm1, dm2, method='pearson', permutations=999) print(f"Correlation: {mantel_results[0]}, p-value: {mantel_results[1]}") # Differential abundance on a feature table (raw counts recommended) from skbio.stats.composition import dirmult_ttest da = dirmult_ttest(counts_table, grouping, treatment='caseA', reference='control') ``` **Important notes:** - Permutation tests provide non-parametric significance testing - Use 999+ permutations for robust p-values - PERMANOVA sensitive to dispersion differences; pair with PERMDISP - Mantel tests assess matrix correlation (e.g., geographic vs genetic distance) - Supply differential-abundance tests with raw counts, not pre-normalized proportions, to preserve magnitude information ### 7. File I/O and Format Conversion Read and write 19+ biological file formats with automatic format detection. **Supported formats:** - Sequences: FASTA, FASTQ, GenBank, EMBL, QSeq - Alignments: Clustal, PHYLIP, Stockholm - Trees: Newick - Tables: BIOM (HDF5 and JSON) - Distances: delimited square matrices - Analysis: BLAST+6/7, GFF3, Ordination results - Metadata: TSV/CSV with validation **Common patterns:** ```python import skbio # Read with automatic format detection seq = skbio.DNA.read('file.fasta', format='fasta') tree = skbio.TreeNode.read('tree.nwk') # Write to file seq.write('output.fasta', format='fasta') # Generator for large files (memory efficient) for seq in skbio.io.read('large.fasta', format='fasta', constructor=skbio.DNA): process(seq) # Convert formats seqs = list(skbio.io.read('input.fastq', format='fastq', constructor=skbio.DNA)) skbio.io.write(seqs, format='fasta', into='output.fasta') ``` **Important notes:** - Use generators for large files to avoid memory issues - Format can be auto-detected when `into` parameter specified - Some objects can be written to multiple formats - Support for stdin/stdout piping with `verify=False` ### 8. Distance Matrices Create and manipulate distance/dissimilarity matrices with statistical methods. **Key capabilities:** - Store symmetric (`DistanceMatrix`, hollow diagonal) or general pairwise (`PairwiseMatrix`) data - ID-based indexing and slicing - Integration with diversity, ordination, and statistical tests - Read/write delimited text format **Common patterns:** ```python from skbio import DistanceMatrix import numpy as np # Create from array data = np.array([[0, 1, 2], [1, 0, 3], [2, 3, 0]]) dm = DistanceMatrix(data, ids=['A', 'B', 'C']) # Access distances dist_ab = dm['A', 'B'] row_a = dm['A'] # Read from file dm = DistanceMatrix.read('distances.txt') # Use in downstream analyses pcoa_results = pcoa(dm) permanova_results = permanova(dm, grouping) ``` **Important notes:** - `DistanceMatrix` enforces symmetry and a zero (hollow) diagonal; it is a subclass of `SymmetricMatrix` - `PairwiseMatrix` (renamed from `DissimilarityMatrix`, which is kept as a deprecated alias) allows general/asymmetric values - IDs enable integration with metadata and biological knowledge - Compatible with pandas, numpy, and scikit-learn ### 9. Biological Tables Work with feature tables (OTU/ASV tables) common in microbiome research. **Key capabilities:** - BIOM format I/O (HDF5 and JSON) via the native `Table` class - Table dispatch system (0.7.0+): functions accept any `table_like` input — BIOM `Table`, pandas/polars DataFrame, NumPy array, or AnnData — without explicit conversion - Data augmentation techniques (`phylomix`, `mixup`, `aitchison_mixup`, `compos_cutmix`) - Sample/feature filtering and normalization - Metadata integration **Common patterns:** ```python from skbio import Table from skbio.diversity import beta_diversity # Read BIOM table table = Table.read('table.biom') # Access data sample_ids = table.ids(axis='sample') feature_ids = table.ids(axis='observation') counts = table.matrix_data # Filter filtered = table.filter(sample_ids_to_keep, axis='sample') # Pass table-like objects directly to scikit-bio drivers (dispatch system) import pandas as pd df = pd.read_table('data.tsv', index_col=0) # samples x features bdiv = beta_diversity('braycurtis', df) # no manual conversion needed ``` **Important notes:** - BIOM tables are standard in QIIME 2 workflows - Rows typically represent samples, columns represent features (OTUs/ASVs) - Supports sparse and dense representations - With the dispatch system, functions return the same format as their input, or a user-specified output format ### 10. Protein Embeddings Work with protein language model embeddings for downstream analysis. **Key capabilities:** - Store embeddings from protein language models (ESM, ProtTrans, etc.) - Convert embeddings to distance matrices - Generate ordination objects for visualization - Export to numpy/pandas for ML workflows **Common patterns:** ```python from skbio.embedding import ProteinEmbedding, ProteinVector # Create embedding from array embedding = ProteinEmbedding(embedding_array, sequence_ids) # Convert to distance matrix for analysis dm = embedding.to_distances(metric='euclidean') # PCoA visualization of embedding space pcoa_results = embedding.to_ordination(metric='euclidean', method='pcoa') # Export for machine learning array = embedding.to_array() df = embedding.to_dataframe() ``` **Important notes:** - Embeddings bridge protein language models with traditional bioinformatics - Compatible with scikit-bio's distance/ordination/statistics ecosystem - SequenceEmbedding and ProteinEmbedding provide specialized functionality - Useful for sequence clustering, classification, and visualization ## Best Practices ### Installation ```bash uv pip install scikit-bio ``` Requires Python 3.10+ and NumPy 2.0+. Pre-compiled wheels are published for each release since 0.7.0, so most platforms install without a compiler. Conda users can instead run `conda install -c conda-forge scikit-bio`. ### Performance Considerations - Use generators for large sequence files to minimize memory usage - For massive phylogenetic trees, prefer GME or BME over NJ - Beta diversity calculations can be parallelized with `partial_beta_diversity()` - BIOM format (HDF5) more efficient than JSON for large tables ### Integration with Ecosystem - Sequences interoperate with Biopython via standard formats - Tables integrate with pandas, polars, and AnnData - Distance matrices compatible with scikit-learn - Ordination results visualizable with matplotlib/seaborn/plotly - Works seamlessly with QIIME 2 artifacts (BIOM, trees, distance matrices) ### Common Workflows 1. **Microbiome diversity analysis**: Read BIOM table → Calculate alpha/beta diversity → Ordination (PCoA) → Statistical testing (PERMANOVA) 2. **Phylogenetic analysis**: Read sequences → Align → Build distance matrix → Construct tree → Calculate phylogenetic distances 3. **Sequence processing**: Read FASTQ → Quality filter → Trim/clean → Find motifs → Translate → Write FASTA 4. **Comparative genomics**: Read sequences → Pairwise alignment → Calculate distances → Build tree → Analyze clades ## Reference Documentation For detailed API information, parameter specifications, and advanced usage examples, refer to `references/api_reference.md` which contains comprehensive documentation on: - Complete method signatures and parameters for all capabilities - Extended code examples for complex workflows - Troubleshooting common issues - Performance optimization tips - Integration patterns with other libraries ## Additional Resources - Official documentation: https://scikit.bio/docs/latest/ - GitHub repository: https://github.com/scikit-bio/scikit-bio - Changelog: https://github.com/scikit-bio/scikit-bio/blob/main/CHANGELOG.md - Reference paper: "scikit-bio: a fundamental Python library for biological omic data," *Nature Methods* (2025), https://www.nature.com/articles/s41592-025-02981-z - Forum support: https://forum.qiime2.org (scikit-bio is part of QIIME 2 ecosystem)