--- name: anndata description: This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools. --- # AnnData ## Overview AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis. ## When to Use This Skill Use this skill when: - Creating, reading, or writing AnnData objects - Working with h5ad, zarr, or other genomics data formats - Performing single-cell RNA-seq analysis - Managing large datasets with sparse matrices or backed mode - Concatenating multiple datasets or experimental batches - Subsetting, filtering, or transforming annotated data - Integrating with scanpy, scvi-tools, or other scverse ecosystem tools ## Installation ```bash pip install anndata # With optional dependencies pip install anndata[dev,test,doc] ``` ## Quick Start ### Creating an AnnData object ```python import anndata as ad import numpy as np import pandas as pd # Minimal creation X = np.random.rand(100, 2000) # 100 cells × 2000 genes adata = ad.AnnData(X) # With metadata obs = pd.DataFrame({ 'cell_type': ['T cell', 'B cell'] * 50, 'sample': ['A', 'B'] * 50 }, index=[f'cell_{i}' for i in range(100)]) var = pd.DataFrame({ 'gene_name': [f'Gene_{i}' for i in range(2000)] }, index=[f'ENSG{i:05d}' for i in range(2000)]) adata = ad.AnnData(X=X, obs=obs, var=var) ``` ### Reading data ```python # Read h5ad file adata = ad.read_h5ad('data.h5ad') # Read with backed mode (for large files) adata = ad.read_h5ad('large_data.h5ad', backed='r') # Read other formats adata = ad.read_csv('data.csv') adata = ad.read_loom('data.loom') adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5') ``` ### Writing data ```python # Write h5ad file adata.write_h5ad('output.h5ad') # Write with compression adata.write_h5ad('output.h5ad', compression='gzip') # Write other formats adata.write_zarr('output.zarr') adata.write_csvs('output_dir/') ``` ### Basic operations ```python # Subset by conditions t_cells = adata[adata.obs['cell_type'] == 'T cell'] # Subset by indices subset = adata[0:50, 0:100] # Add metadata adata.obs['quality_score'] = np.random.rand(adata.n_obs) adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8 # Access dimensions print(f"{adata.n_obs} observations × {adata.n_vars} variables") ``` ## Core Capabilities ### 1. Data Structure Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components. **See**: `references/data_structure.md` for comprehensive information on: - Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw) - Creating AnnData objects from various sources - Accessing and manipulating data components - Memory-efficient practices ### 2. Input/Output Operations Read and write data in various formats with support for compression, backed mode, and cloud storage. **See**: `references/io_operations.md` for details on: - Native formats (h5ad, zarr) - Alternative formats (CSV, MTX, Loom, 10X, Excel) - Backed mode for large datasets - Remote data access - Format conversion - Performance optimization Common commands: ```python # Read/write h5ad adata = ad.read_h5ad('data.h5ad', backed='r') adata.write_h5ad('output.h5ad', compression='gzip') # Read 10X data adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5') # Read MTX format adata = ad.read_mtx('matrix.mtx').T ``` ### 3. Concatenation Combine multiple AnnData objects along observations or variables with flexible join strategies. **See**: `references/concatenation.md` for comprehensive coverage of: - Basic concatenation (axis=0 for observations, axis=1 for variables) - Join types (inner, outer) - Merge strategies (same, unique, first, only) - Tracking data sources with labels - Lazy concatenation (AnnCollection) - On-disk concatenation for large datasets Common commands: ```python # Concatenate observations (combine samples) adata = ad.concat( [adata1, adata2, adata3], axis=0, join='inner', label='batch', keys=['batch1', 'batch2', 'batch3'] ) # Concatenate variables (combine modalities) adata = ad.concat([adata_rna, adata_protein], axis=1) # Lazy concatenation from anndata.experimental import AnnCollection collection = AnnCollection( ['data1.h5ad', 'data2.h5ad'], join_obs='outer', label='dataset' ) ``` ### 4. Data Manipulation Transform, subset, filter, and reorganize data efficiently. **See**: `references/manipulation.md` for detailed guidance on: - Subsetting (by indices, names, boolean masks, metadata conditions) - Transposition - Copying (full copies vs views) - Renaming (observations, variables, categories) - Type conversions (strings to categoricals, sparse/dense) - Adding/removing data components - Reordering - Quality control filtering Common commands: ```python # Subset by metadata filtered = adata[adata.obs['quality_score'] > 0.8] hv_genes = adata[:, adata.var['highly_variable']] # Transpose adata_T = adata.T # Copy vs view view = adata[0:100, :] # View (lightweight reference) copy = adata[0:100, :].copy() # Independent copy # Convert strings to categoricals adata.strings_to_categoricals() ``` ### 5. Best Practices Follow recommended patterns for memory efficiency, performance, and reproducibility. **See**: `references/best_practices.md` for guidelines on: - Memory management (sparse matrices, categoricals, backed mode) - Views vs copies - Data storage optimization - Performance optimization - Working with raw data - Metadata management - Reproducibility - Error handling - Integration with other tools - Common pitfalls and solutions Key recommendations: ```python # Use sparse matrices for sparse data from scipy.sparse import csr_matrix adata.X = csr_matrix(adata.X) # Convert strings to categoricals adata.strings_to_categoricals() # Use backed mode for large files adata = ad.read_h5ad('large.h5ad', backed='r') # Store raw before filtering adata.raw = adata.copy() adata = adata[:, adata.var['highly_variable']] ``` ## Integration with Scverse Ecosystem AnnData serves as the foundational data structure for the scverse ecosystem: ### Scanpy (Single-cell analysis) ```python import scanpy as sc # Preprocessing sc.pp.filter_cells(adata, min_genes=200) sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) # Dimensionality reduction sc.pp.pca(adata, n_comps=50) sc.pp.neighbors(adata, n_neighbors=15) sc.tl.umap(adata) sc.tl.leiden(adata) # Visualization sc.pl.umap(adata, color=['cell_type', 'leiden']) ``` ### Muon (Multimodal data) ```python import muon as mu # Combine RNA and protein data mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein}) ``` ### PyTorch integration ```python from anndata.experimental import AnnLoader # Create DataLoader for deep learning dataloader = AnnLoader(adata, batch_size=128, shuffle=True) for batch in dataloader: X = batch.X # Train model ``` ## Common Workflows ### Single-cell RNA-seq analysis ```python import anndata as ad import scanpy as sc # 1. Load data adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5') # 2. Quality control adata.obs['n_genes'] = (adata.X > 0).sum(axis=1) adata.obs['n_counts'] = adata.X.sum(axis=1) adata = adata[adata.obs['n_genes'] > 200] adata = adata[adata.obs['n_counts'] < 50000] # 3. Store raw adata.raw = adata.copy() # 4. Normalize and filter sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) adata = adata[:, adata.var['highly_variable']] # 5. Save processed data adata.write_h5ad('processed.h5ad') ``` ### Batch integration ```python # Load multiple batches adata1 = ad.read_h5ad('batch1.h5ad') adata2 = ad.read_h5ad('batch2.h5ad') adata3 = ad.read_h5ad('batch3.h5ad') # Concatenate with batch labels adata = ad.concat( [adata1, adata2, adata3], label='batch', keys=['batch1', 'batch2', 'batch3'], join='inner' ) # Apply batch correction import scanpy as sc sc.pp.combat(adata, key='batch') # Continue analysis sc.pp.pca(adata) sc.pp.neighbors(adata) sc.tl.umap(adata) ``` ### Working with large datasets ```python # Open in backed mode adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r') # Filter based on metadata (no data loading) high_quality = adata[adata.obs['quality_score'] > 0.8] # Load filtered subset adata_subset = high_quality.to_memory() # Process subset process(adata_subset) # Or process in chunks chunk_size = 1000 for i in range(0, adata.n_obs, chunk_size): chunk = adata[i:i+chunk_size, :].to_memory() process(chunk) ``` ## Troubleshooting ### Out of memory errors Use backed mode or convert to sparse matrices: ```python # Backed mode adata = ad.read_h5ad('file.h5ad', backed='r') # Sparse matrices from scipy.sparse import csr_matrix adata.X = csr_matrix(adata.X) ``` ### Slow file reading Use compression and appropriate formats: ```python # Optimize for storage adata.strings_to_categoricals() adata.write_h5ad('file.h5ad', compression='gzip') # Use Zarr for cloud storage adata.write_zarr('file.zarr', chunks=(1000, 1000)) ``` ### Index alignment issues Always align external data on index: ```python # Wrong adata.obs['new_col'] = external_data['values'] # Correct adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values'] ``` ## Additional Resources - **Official documentation**: https://anndata.readthedocs.io/ - **Scanpy tutorials**: https://scanpy.readthedocs.io/ - **Scverse ecosystem**: https://scverse.org/ - **GitHub repository**: https://github.com/scverse/anndata