--- name: lamindb-data-management description: "Open-source FAIR biology data framework. Version artifacts (AnnData, DataFrame, Zarr), track lineage, validate via ontologies (Bionty), query datasets. Integrates with Nextflow, Snakemake, W&B, scVI. For scRNA-seq use scanpy; for ontology lookups use bionty." license: Apache-2.0 --- # LaminDB — Biological Data Management ## Overview LaminDB is an open-source data framework for biology that makes data queryable, traceable, and FAIR (Findable, Accessible, Interoperable, Reusable). It combines data lakehouse architecture, lineage tracking, biological ontology validation, and a unified Python API for managing biological datasets from raw files to annotated, curated artifacts. ## When to Use - Managing and versioning biological datasets (scRNA-seq, spatial, flow cytometry, multi-modal) - Tracking computational lineage (which code produced which data) - Validating and curating data against biological ontologies (cell types, genes, tissues, diseases) - Building queryable data lakehouses across multiple experiments - Ensuring reproducibility with automatic environment and provenance capture - Integrating with workflow managers (Nextflow, Snakemake) or MLOps (W&B, MLflow) - Standardizing metadata with ontology-based annotation (Bionty) - For **single-cell analysis pipelines** (clustering, DE), use scanpy instead - For **ontology lookups only** without data management, use bionty directly ## Prerequisites ```bash pip install lamindb # With extras for specific data types pip install 'lamindb[bionty,zarr,fcs]' ``` **Setup**: Requires instance initialization before use: ```bash lamin login lamin init --storage ./my-data --name my-project # Or with cloud storage: # lamin init --storage s3://my-bucket --name my-project --db postgresql://... ``` **Instance types**: Local SQLite (development), Cloud + SQLite (small teams), Cloud + PostgreSQL (production). ## Quick Start ```python import lamindb as ln ln.track() # Start lineage tracking # Save an artifact import pandas as pd df = pd.DataFrame({"gene": ["TP53", "BRCA1"], "score": [0.95, 0.87]}) artifact = ln.Artifact.from_df(df, key="results/gene_scores.parquet", description="Gene importance scores") artifact.save() print(f"Saved: {artifact.uid}, size: {artifact.size}") # Query artifacts results = ln.Artifact.filter(key__startswith="results/").df() print(f"Found {len(results)} artifacts") ln.finish() ``` ## Core API ### 1. Artifacts — Data Objects Artifacts are versioned data objects (files, DataFrames, AnnData, arrays). ```python import lamindb as ln import pandas as pd import anndata as ad ln.track() # From DataFrame df = pd.DataFrame({"sample": ["A", "B"], "value": [1.5, 2.3]}) artifact = ln.Artifact.from_df(df, key="experiments/batch1.parquet").save() print(f"ID: {artifact.uid}, Version: {artifact.version}") # From AnnData adata = ad.read_h5ad("counts.h5ad") artifact = ln.Artifact.from_anndata(adata, key="scrna/batch1.h5ad", description="scRNA-seq batch 1").save() # From file path artifact = ln.Artifact("results/figure.png", key="figures/fig1.png").save() # Load back df_loaded = artifact.load() # Returns DataFrame/AnnData/etc. path = artifact.cache() # Returns local file path ``` ```python # Versioning artifact_v2 = ln.Artifact.from_df(df_updated, key="experiments/batch1.parquet", revises=artifact).save() print(f"v1: {artifact.uid}, v2: {artifact_v2.uid}") print(f"Latest version: {artifact_v2.is_latest}") # Delete (archive first, then permanent) artifact.delete(permanent=False) # Archive # artifact.delete(permanent=True) # Permanent deletion ``` ### 2. Lineage Tracking Automatic provenance capture for reproducibility. ```python import lamindb as ln # Start tracking — captures notebook/script, environment, user ln.track(params={"method": "PCA", "n_components": 50}) # All artifacts created within this block are linked to this run input_data = ln.Artifact.get(key="raw/counts.h5ad") adata = input_data.load() # ... analysis code ... output = ln.Artifact.from_anndata(adata, key="processed/pca.h5ad").save() # View lineage graph output.view_lineage() ln.finish() # Finalize tracking ``` ### 3. Querying and Filtering Search and filter artifacts by metadata, features, and annotations. ```python import lamindb as ln # Basic filtering artifacts = ln.Artifact.filter(key__startswith="scrna/").df() print(f"Found {len(artifacts)} scRNA-seq artifacts") # Filter by metadata recent = ln.Artifact.filter( created_at__gte="2026-01-01", size__gt=1000000 ).df() # Filter by annotated features immune = ln.Artifact.filter( cell_types__name="T cell", tissues__name="PBMC" ).df() # Single record retrieval artifact = ln.Artifact.get(key="results/final.parquet") # Exact match, raises if not found artifact = ln.Artifact.filter(key="results/final.parquet").one_or_none() # Returns None if missing # Full-text search results = ln.Artifact.search("gene expression PBMC") # Streaming large files (without full load into memory) artifact = ln.Artifact.get(key="large_dataset.h5ad") backed = artifact.open() # AnnData-backed mode subset = backed[backed.obs["cell_type"] == "B cell"] ``` ### 4. Annotation and Validation Curate datasets against schemas and ontology terms. ```python import lamindb as ln import bionty as bt # Annotate artifacts with features artifact = ln.Artifact.get(key="scrna/batch1.h5ad") artifact.features.add_values({ "tissue": "PBMC", "condition": "treated", "organism": "human", "batch": 1 }) # Validate with schema curator = ln.curators.AnnDataCurator(adata, schema) try: curator.validate() artifact = curator.save_artifact(key="validated/batch1.h5ad") print("Validation passed") except ln.errors.ValidationError as e: print(f"Validation failed: {e}") # Standardize cell type names using ontology adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"]) ``` ### 5. Biological Ontologies (Bionty) Access standardized biological vocabularies for annotation. ```python import bionty as bt # Available ontologies # bt.Gene (Ensembl), bt.Protein (UniProt), bt.CellType (CL), # bt.Tissue (Uberon), bt.Disease (Mondo), bt.Pathway (GO), # bt.CellLine (CLO), bt.Phenotype (HPO), bt.Organism (NCBItaxon) # Import and search ontology bt.CellType.import_source() results = bt.CellType.search("T helper") print(results.head()) # Get specific term t_cell = bt.CellType.get(name="T cell") print(f"Ontology ID: {t_cell.ontology_id}") # Explore hierarchy children = t_cell.children.all() parents = t_cell.parents.all() print(f"Children: {[c.name for c in children]}") # Validate a list of terms validated = bt.CellType.validate(["T cell", "B cell", "Unknown_type"]) # Returns boolean array: [True, True, False] ``` ### 6. Collections and Organization Group related artifacts for batch operations. ```python import lamindb as ln # Create a collection artifacts = ln.Artifact.filter(key__startswith="scrna/batch_").all() collection = ln.Collection(artifacts, name="scRNA-seq batches Q1 2026").save() print(f"Collection: {collection.name}, {collection.n_objects} artifacts") # Query collection for artifact in collection.artifacts.all(): print(f" {artifact.key}: {artifact.size} bytes") # Organize with hierarchical keys # Convention: project/experiment/datatype/file # e.g., "immunology/exp42/scrna/counts.h5ad" ``` ## Key Concepts ### Core Entity Model | Entity | Purpose | Example | |--------|---------|---------| | **Artifact** | Versioned data object | `counts.h5ad`, `results.parquet` | | **Run** | Single code execution | Notebook run, script execution | | **Transform** | Code definition (notebook, script, pipeline) | `analysis.ipynb` | | **Feature** | Typed metadata field | `tissue`, `condition`, `batch` | | **Collection** | Group of related artifacts | "Experiment batches" | | **ULabel** | Universal label for custom categorization | "high_quality", "pilot" | ### Data Types Supported | Format | Method | Use Case | |--------|--------|----------| | DataFrame | `Artifact.from_df()` | Tabular data, metadata tables | | AnnData | `Artifact.from_anndata()` | Single-cell data | | MuData | `Artifact.from_mudata()` | Multi-modal data | | Any file | `Artifact("path")` | Images, FASTQ, custom formats | | Zarr | Via zarr extra | Large array data | | TileDB-SOMA | Via tiledbsoma extra | Scalable cell-level queries | ### track() / finish() Pattern Every analysis session should be wrapped: ```python ln.track(params={"key": "value"}) # Start: captures code, environment, user # ... analysis ... ln.finish() # End: finalizes lineage links ``` ## Common Workflows ### Workflow: Multi-Experiment Data Lakehouse ```python import lamindb as ln import anndata as ad ln.track() # Register multiple experiments data_files = ["batch1.h5ad", "batch2.h5ad", "batch3.h5ad"] tissues = ["PBMC", "bone_marrow", "PBMC"] conditions = ["control", "treated", "treated"] for i, (file, tissue, condition) in enumerate(zip(data_files, tissues, conditions)): adata = ad.read_h5ad(file) artifact = ln.Artifact.from_anndata( adata, key=f"scrna/batch_{i}.h5ad", description=f"scRNA-seq batch {i}" ).save() artifact.features.add_values({ "tissue": tissue, "condition": condition, "batch": i }) print(f"Registered batch {i}: {artifact.uid}") # Query across all experiments treated_pbmc = ln.Artifact.filter( key__startswith="scrna/", features__tissue="PBMC", features__condition="treated" ).all() print(f"Found {len(treated_pbmc)} matching datasets") # Load and concatenate import anndata as ad adatas = [a.load() for a in treated_pbmc] combined = ad.concat(adatas) print(f"Combined: {combined.shape}") ln.finish() ``` ### Workflow: Validated Data Curation ```python import lamindb as ln import bionty as bt import anndata as ad ln.track() # 1. Import ontologies bt.CellType.import_source() bt.Gene.import_source(organism="human") # 2. Load raw data adata = ad.read_h5ad("raw_counts.h5ad") print(f"Raw: {adata.shape}") # 3. Validate and standardize cell types validated = bt.CellType.validate(adata.obs["cell_type"].unique()) if not all(validated): adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"]) # 4. Validate gene names gene_validated = bt.Gene.validate(adata.var_names) print(f"Valid genes: {sum(gene_validated)}/{len(gene_validated)}") # 5. Curate and save curator = ln.curators.AnnDataCurator(adata, schema) curator.validate() artifact = curator.save_artifact(key="curated/validated_counts.h5ad") print(f"Saved curated artifact: {artifact.uid}") ln.finish() ``` ### Workflow: Nextflow Pipeline Integration 1. In each Nextflow process, import lamindb and call `ln.track()` 2. Load input artifacts with `ln.Artifact.get(key=...)`; cache to local path 3. Run analysis; save output as new artifact with `ln.Artifact(...).save()` 4. Call `ln.finish()` — lineage automatically links inputs to outputs ## Key Parameters | Parameter | Function | Default | Options | Effect | |-----------|----------|---------|---------|--------| | `key` | `Artifact()` | None | String path | Hierarchical storage key (e.g., "project/data.h5ad") | | `description` | `Artifact()` | None | String | Human-readable description | | `revises` | `Artifact()` | None | Artifact | Previous version to revise | | `params` | `ln.track()` | None | Dict | Parameters for the current run | | `organism` | `bt.Gene.import_source()` | None | "human", "mouse" | Organism for ontology | | `permanent` | `.delete()` | False | True/False | Permanent vs archive deletion | | `__startswith` | `.filter()` | — | String | Key prefix filter | | `__gte`, `__lte` | `.filter()` | — | Value | Greater/less than or equal | | `__contains` | `.filter()` | — | String | Substring match | ## Best Practices 1. **Always wrap analysis with `ln.track()` / `ln.finish()`**: This captures lineage automatically. Without it, artifacts have no provenance. 2. **Use hierarchical keys**: Structure as `project/experiment/datatype/file.ext` (e.g., `immunology/exp42/scrna/counts.h5ad`). This enables prefix-based queries. 3. **Anti-pattern — duplicating data instead of versioning**: Use the `revises=` parameter to create new versions, not new keys for the same dataset. 4. **Validate early**: Run schema validation before analysis. Catching bad metadata early saves debugging time downstream. 5. **Use ontologies for standardization**: Map free-text labels to ontology terms (e.g., "T helper cell" → CL:0000912). This enables cross-dataset queries. 6. **Anti-pattern — loading large files without checking size**: Use `.filter().df()` to inspect metadata first, then `.load()` or `.open()` (backed mode) for large files. 7. **Query metadata first, load data second**: Filter with `.filter()` to find relevant artifacts, then load only what you need. ## Common Recipes ### Recipe: Bulk Dataset Registration ```python import lamindb as ln from pathlib import Path ln.track() data_dir = Path("raw_data/") for fcs_file in data_dir.glob("*.fcs"): artifact = ln.Artifact(str(fcs_file), key=f"flow_cytometry/{fcs_file.name}").save() artifact.features.add_values({"assay": "flow_cytometry", "source": "batch_import"}) print(f"Registered: {fcs_file.name} -> {artifact.uid}") ln.finish() ``` ### Recipe: View and Export Lineage ```python import lamindb as ln artifact = ln.Artifact.get(key="results/final_analysis.h5ad") # View lineage graph (opens in browser or notebook) artifact.view_lineage() # Programmatic lineage access run = artifact.run print(f"Created by: {run.transform.name}") print(f"User: {run.created_by.name}") print(f"Date: {run.created_at}") print(f"Input artifacts: {[a.key for a in run.input_artifacts.all()]}") ``` ### Recipe: Ontology Hierarchy Exploration ```python import bionty as bt bt.CellType.import_source() t_cell = bt.CellType.get(name="T cell") # Explore hierarchy print(f"Parents: {[p.name for p in t_cell.parents.all()]}") print(f"Children: {[c.name for c in t_cell.children.all()]}") # Find all descendants descendants = t_cell.children.all() for child in descendants: grandchildren = child.children.all() print(f" {child.name}: {[gc.name for gc in grandchildren]}") ``` ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `InstanceNotSetupError` | Instance not initialized | Run `lamin init --storage ./data --name my-project` | | `ln.track()` fails | No transform context | Run inside a notebook/script, not REPL; or pass `transform` explicitly | | Artifact `key` conflict | Key already exists (not a version) | Use `revises=` for versioning, or choose a different key | | `ValidationError` | Data doesn't match schema | Run `curator.validate()` to see specific failures; standardize terms | | Slow queries on large instances | No index on filtered field | Use `.df()` for overview first; add database indexes for frequently filtered fields | | Ontology import fails | Network issue or wrong organism | Check internet connection; specify `organism="human"` explicitly | | `FileNotFoundError` on `.cache()` | Cloud artifact not synced | Check storage connectivity; use `artifact.load()` instead for in-memory access | ## Related Skills - **anndata-annotated-data** — AnnData format used as primary data container in LaminDB for single-cell data - **scanpy-scrna-seq** — single-cell analysis pipeline; LaminDB manages data that scanpy analyzes - **scvi-tools-single-cell** — deep learning models for single-cell; integrates with LaminDB for data/model tracking ## References - [LaminDB documentation](https://docs.lamin.ai) — official user guide and API reference - [LaminDB tutorial](https://docs.lamin.ai/tutorial) — step-by-step introduction - [Bionty documentation](https://docs.lamin.ai/bionty) — biological ontology management - [LaminDB GitHub](https://github.com/laminlabs/lamindb) — source code and issues