--- name: zarr-python description: "Chunked N-D arrays with compression and cloud storage. NumPy-style indexing. Backends: local, S3, GCS, ZIP, memory. Dask/Xarray integration for parallel and labeled computation. For lineage use lamindb; for labeled arrays use xarray." license: MIT --- # Zarr Python — Chunked N-D Arrays ## Overview Zarr is a Python library for storing large N-dimensional arrays with chunking, compression, and parallel I/O. It provides NumPy-compatible indexing with pluggable storage backends (local, cloud, in-memory), making it the standard format for cloud-native scientific data pipelines. ## When to Use - Storing arrays too large for memory with chunked access (out-of-core computing) - Cloud-native data workflows with S3 or GCS storage backends - Parallel read/write with Dask for large-scale computation - Hierarchical data organization (groups of named arrays with metadata) - Converting between formats (HDF5 → Zarr, NetCDF → Zarr) - Appending time-series data incrementally without rewriting - For **labeled, coordinate-aware arrays** (time, lat, lon), use xarray with Zarr backend instead - For **data management, lineage, and ontology validation**, use lamindb (which uses Zarr as a storage format) ## Prerequisites ```bash pip install zarr # Cloud storage support pip install s3fs # Amazon S3 pip install gcsfs # Google Cloud Storage ``` Requires Python 3.11+. ## Quick Start ```python import zarr import numpy as np # Create a chunked, compressed 2D array z = zarr.create_array( store="data/my_array.zarr", shape=(10000, 10000), chunks=(1000, 1000), dtype="f4" ) # Write with NumPy-style indexing z[:, :] = np.random.random((10000, 10000)).astype("f4") # Read a slice (only reads needed chunks) subset = z[0:100, 0:100] print(f"Shape: {subset.shape}, dtype: {subset.dtype}") # Shape: (100, 100), dtype: float32 ``` ## Core API ### 1. Array Creation ```python import zarr import numpy as np # Empty arrays z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000), dtype="f4", store="data.zarr") z = zarr.ones((5000, 5000), chunks=(500, 500), dtype="f4") z = zarr.full((1000, 1000), fill_value=42, chunks=(100, 100), dtype="i4") # From existing NumPy data data = np.arange(10000, dtype="f4").reshape(100, 100) z = zarr.array(data, chunks=(10, 10), store="from_numpy.zarr") print(f"Created: shape={z.shape}, chunks={z.chunks}, dtype={z.dtype}") # Create like another array (matches shape, chunks, dtype) z2 = zarr.zeros_like(z) ``` ```python # Open existing array z = zarr.open_array("data.zarr", mode="r+") # Read-write z = zarr.open_array("data.zarr", mode="r") # Read-only z = zarr.open("data.zarr") # Auto-detect array vs group ``` ### 2. Reading, Writing, and Indexing ```python import zarr import numpy as np z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype="f4") # Write slices z[0, :] = np.arange(10000, dtype="f4") z[10:20, 50:60] = np.random.random((10, 10)).astype("f4") z[:] = 42 # Fill entire array # Read slices (returns NumPy array) row = z[5, :] block = z[0:100, 0:100] print(f"Row shape: {row.shape}, block shape: {block.shape}") # Advanced indexing z.vindex[[0, 5, 10], [2, 8, 15]] # Coordinate (fancy) indexing z.oindex[0:10, [5, 10, 15]] # Orthogonal indexing z.blocks[0, 0] # Block/chunk indexing # Resize and append z.resize(15000, 15000) z.append(np.random.random((1000, 10000)).astype("f4"), axis=0) ``` ### 3. Chunking and Sharding Chunk shape is the most important performance parameter. ```python import zarr from zarr.codecs import ShardingCodec # Chunk aligned with access pattern # Row-wise access → chunk spans columns z_row = zarr.zeros((10000, 10000), chunks=(10, 10000), dtype="f4") # Column-wise access → chunk spans rows z_col = zarr.zeros((10000, 10000), chunks=(10000, 10), dtype="f4") # Mixed access → balanced square chunks (~1MB each for float32) z_bal = zarr.zeros((10000, 10000), chunks=(512, 512), dtype="f4") # 512*512*4 bytes = ~1MB per chunk # Sharding: group small chunks into larger storage objects # Useful when millions of small chunks cause filesystem overhead z_sharded = zarr.create_array( store="sharded.zarr", shape=(100000, 100000), chunks=(100, 100), # Small chunks for fine-grained access shards=(1000, 1000), # Groups 100 chunks per shard dtype="f4" ) print(f"Chunks: {z_sharded.chunks}, shards reduce file count") ``` **Chunk size guidelines**: - Target **1–10 MB per chunk** (minimum 1 MB) - Align chunk shape with your most common access pattern - Entire chunks load into memory during read → don't exceed available RAM - Entire shards load into memory during write ### 4. Compression ```python from zarr.codecs.blosc import BloscCodec from zarr.codecs import GzipCodec, ZstdCodec, BytesCodec import zarr # Default: Blosc with Zstandard (good balance) z = zarr.zeros((1000, 1000), chunks=(100, 100), dtype="f4") # Explicit Blosc configuration z = zarr.create_array( store="compressed.zarr", shape=(1000, 1000), chunks=(100, 100), dtype="f4", codecs=[BloscCodec(cname="zstd", clevel=5, shuffle="shuffle")] ) # Speed-optimized (LZ4) z_fast = zarr.create_array( store="fast.zarr", shape=(1000, 1000), chunks=(100, 100), dtype="f4", codecs=[BloscCodec(cname="lz4", clevel=1)] ) # Maximum compression (Gzip level 9) z_small = zarr.create_array( store="small.zarr", shape=(1000, 1000), chunks=(100, 100), dtype="f4", codecs=[GzipCodec(level=9)] ) # No compression z_raw = zarr.create_array( store="raw.zarr", shape=(1000, 1000), chunks=(100, 100), dtype="f4", codecs=[BytesCodec()] ) ``` **Codec selection**: Blosc/Zstd (default, balanced) → LZ4 (fastest) → Gzip (smallest). Enable `shuffle="shuffle"` for numeric data — it reorders bytes for better compression ratios. ### 5. Storage Backends ```python import zarr import numpy as np from zarr.storage import LocalStore, MemoryStore, ZipStore # Local filesystem (default — string paths create LocalStore automatically) z = zarr.open_array("data/array.zarr", mode="w", shape=(1000, 1000), chunks=(100, 100), dtype="f4") # In-memory (not persisted) store = MemoryStore() z_mem = zarr.open_array(store=store, mode="w", shape=(1000, 1000), chunks=(100, 100), dtype="f4") # ZIP file storage store = ZipStore("data.zip", mode="w") z_zip = zarr.open_array(store=store, mode="w", shape=(1000, 1000), chunks=(100, 100), dtype="f4") z_zip[:] = np.random.random((1000, 1000)).astype("f4") store.close() # IMPORTANT: must close ZipStore ``` ```python # Cloud storage: Amazon S3 import s3fs s3 = s3fs.S3FileSystem(anon=False) store = s3fs.S3Map(root="my-bucket/path/array.zarr", s3=s3) z = zarr.open_array(store=store, mode="w", shape=(1000, 1000), chunks=(500, 500), dtype="f4") z[:] = np.random.random((1000, 1000)).astype("f4") # Consolidate metadata for faster subsequent reads zarr.consolidate_metadata(store) # Google Cloud Storage import gcsfs gcs = gcsfs.GCSFileSystem(project="my-project") store = gcsfs.GCSMap(root="my-bucket/path/array.zarr", gcs=gcs) ``` **Cloud best practices**: consolidate metadata, use 5–100 MB chunks, enable sharding to reduce object count, use Dask for parallel I/O. ### 6. Groups and Hierarchies ```python import zarr # Create hierarchical structure (like HDF5 groups) root = zarr.group(store="hierarchy.zarr") # Create sub-groups temperature = root.create_group("temperature") precipitation = root.create_group("precipitation") # Create arrays within groups temp_arr = temperature.create_array( name="t2m", shape=(365, 720, 1440), chunks=(1, 720, 1440), dtype="f4" ) precip_arr = precipitation.create_array( name="prcp", shape=(365, 720, 1440), chunks=(1, 720, 1440), dtype="f4" ) # Access by path arr = root["temperature/t2m"] print(root.tree()) # / # ├── temperature # │ └── t2m (365, 720, 1440) f4 # └── precipitation # └── prcp (365, 720, 1440) f4 ``` ### 7. Attributes and Metadata ```python import zarr z = zarr.zeros((1000, 1000), chunks=(100, 100), dtype="f4") # Attach metadata (must be JSON-serializable) z.attrs["description"] = "Temperature data in Kelvin" z.attrs["units"] = "K" z.attrs["processing_version"] = 2.1 print(z.attrs["units"]) # K # Group-level attributes root = zarr.group("data.zarr") root.attrs["project"] = "Climate Analysis" root.attrs["institution"] = "Research Institute" ``` ## Key Concepts ### Chunk Size Selection Guide | Data Shape | Access Pattern | Recommended Chunks | Rationale | |-----------|---------------|-------------------|-----------| | (N, M) 2D | Row-wise | (small, M) | Each chunk spans full row | | (N, M) 2D | Column-wise | (N, small) | Each chunk spans full column | | (N, M) 2D | Random/mixed | (√(1MB/dtype), √(1MB/dtype)) | Balanced ~1MB per chunk | | (T, H, W) time series | Time slice | (1, H, W) | One timestep per chunk | | (T, H, W) time series | Spatial region | (T, small, small) | Full time for region | **1 MB rule**: For float32 (4 bytes), 1 MB = 262,144 elements. For float64 (8 bytes), 1 MB = 131,072 elements. ### Consolidated Metadata For stores with many arrays (10+), consolidate metadata into a single read: ```python zarr.consolidate_metadata("data.zarr") root = zarr.open_consolidated("data.zarr") # Single metadata read ``` Critical for cloud storage (reduces N metadata requests to 1). Caveat: becomes stale if arrays update without re-consolidation. ## Common Workflows ### Workflow: Cloud-Native Data Pipeline ```python import zarr import numpy as np import s3fs # Step 1: Write to S3 with cloud-optimized chunks s3 = s3fs.S3FileSystem() store = s3fs.S3Map(root="s3://my-bucket/experiment.zarr", s3=s3) root = zarr.group(store=store) data_arr = root.create_array( name="measurements", shape=(10000, 10000), chunks=(500, 500), # ~1MB chunks, good for cloud dtype="f4" ) data_arr[:] = np.random.random((10000, 10000)).astype("f4") data_arr.attrs["experiment"] = "batch_42" # Step 2: Consolidate metadata zarr.consolidate_metadata(store) # Step 3: Read from anywhere store_read = s3fs.S3Map(root="s3://my-bucket/experiment.zarr", s3=s3) root_read = zarr.open_consolidated(store_read) subset = root_read["measurements"][0:100, 0:100] print(f"Read subset: {subset.shape}") ``` ### Workflow: Dask Parallel Computation ```python import dask.array as da import zarr import numpy as np # Step 1: Create large Zarr array z = zarr.open("large_data.zarr", mode="w", shape=(100000, 100000), chunks=(1000, 1000), dtype="f4") # (populate with data...) # Step 2: Load as Dask array (lazy — no data loaded yet) dask_arr = da.from_zarr("large_data.zarr") print(f"Dask array: {dask_arr.shape}, {dask_arr.npartitions} partitions") # Step 3: Compute in parallel (out-of-core) col_means = dask_arr.mean(axis=0).compute() print(f"Column means: {col_means.shape}") # Step 4: Write Dask result back to Zarr large_random = da.random.random((100000, 100000), chunks=(1000, 1000)) da.to_zarr(large_random, "output.zarr") ``` ### Workflow: Xarray-Zarr for Labeled Data ```python import xarray as xr import numpy as np import pandas as pd # Step 1: Create labeled dataset ds = xr.Dataset( { "temperature": (["time", "lat", "lon"], np.random.random((365, 180, 360)).astype("f4")), "precipitation": (["time", "lat", "lon"], np.random.random((365, 180, 360)).astype("f4")), }, coords={ "time": pd.date_range("2024-01-01", periods=365), "lat": np.arange(-90, 90, 1.0), "lon": np.arange(-180, 180, 1.0), } ) # Step 2: Save to Zarr ds.to_zarr("climate.zarr") # Step 3: Open with lazy loading ds_loaded = xr.open_zarr("climate.zarr") print(ds_loaded) # Step 4: Label-based selection (only reads needed chunks) subset = ds_loaded.sel(time="2024-06", lat=slice(30, 60)) print(f"June subset: {subset['temperature'].shape}") ``` ### Workflow: Format Conversion ```python import zarr import numpy as np # HDF5 → Zarr import h5py with h5py.File("data.h5", "r") as h5: z = zarr.array(h5["dataset_name"][:], chunks=(1000, 1000), store="from_hdf5.zarr") # NumPy → Zarr data = np.load("data.npy") z = zarr.array(data, chunks="auto", store="from_numpy.zarr") # Zarr → NetCDF (via Xarray) import xarray as xr ds = xr.open_zarr("data.zarr") ds.to_netcdf("data.nc") ``` ## Key Parameters | Parameter | Module | Default | Options | Effect | |-----------|--------|---------|---------|--------| | `shape` | `create_array` | Required | Tuple of ints | Array dimensions | | `chunks` | `create_array` | Auto | Tuple of ints, `"auto"` | Chunk shape per dimension | | `shards` | `create_array` | None | Tuple of ints | Shard shape (groups chunks) | | `dtype` | `create_array` | `"f8"` | NumPy dtype | Data type | | `codecs` | `create_array` | Blosc/Zstd | List of codec objects | Compression pipeline | | `mode` | `open_array` | `"r"` | `"r"`, `"r+"`, `"w"`, `"a"` | File access mode | | `store` | All | LocalStore | Store object or path | Storage backend | | `cname` | `BloscCodec` | `"zstd"` | `"lz4"`, `"zstd"`, `"gzip"`, etc. | Compressor algorithm | | `clevel` | `BloscCodec` | 5 | 0–9 | Compression level | | `shuffle` | `BloscCodec` | `"noshuffle"` | `"shuffle"`, `"bitshuffle"` | Byte reordering for compression | ## Best Practices 1. **Target 1–10 MB per chunk**: Smaller chunks = more metadata overhead; larger chunks = more memory per read. For float32: 512×512 ≈ 1 MB. 2. **Align chunks with access patterns**: The most common query dimension should be contiguous within chunks. A 65× performance difference is possible with misaligned chunks. 3. **Anti-pattern — loading entire array with `z[:]`**: For large arrays, use Dask (`da.from_zarr`) or process in explicit chunks. `z[:]` loads everything into memory. 4. **Consolidate metadata for cloud stores**: Call `zarr.consolidate_metadata(store)` after creating all arrays. Then open with `zarr.open_consolidated()`. This reduces N metadata reads to 1 — critical for S3/GCS latency. 5. **Close ZipStore explicitly**: Unlike other stores, `ZipStore` requires `store.close()` after writing. Forgetting this corrupts the ZIP file. 6. **Use sharding for millions of small chunks**: When chunk count exceeds ~100k, filesystem overhead dominates. Sharding groups chunks into fewer storage objects. 7. **Anti-pattern — choosing chunk shape based on array shape alone**: Always consider access pattern first, then optimize chunk size. A "nice" shape (1000, 1000) may be terrible if you always read rows. ## Common Recipes ### Recipe: Profiling Array Performance ```python import zarr z = zarr.open("data.zarr") print(z.info) # Shows: type, shape, chunks, dtype, compressor, storage size print(f"Compressed: {z.nbytes_stored / 1e6:.2f} MB") print(f"Uncompressed: {z.nbytes / 1e6:.2f} MB") print(f"Ratio: {z.nbytes / z.nbytes_stored:.1f}x") ``` ### Recipe: Time Series Append ```python import zarr import numpy as np # Create extensible array (start with 0 timesteps) z = zarr.open("timeseries.zarr", mode="a", shape=(0, 720, 1440), chunks=(1, 720, 1440), dtype="f4") # Append new timesteps incrementally for day in range(365): new_step = np.random.random((1, 720, 1440)).astype("f4") z.append(new_step, axis=0) print(f"Final shape: {z.shape}") # (365, 720, 1440) ``` ### Recipe: Parallel Write with Dask ```python import dask.array as da # Generate large dataset in parallel data = da.random.random((100000, 100000), chunks=(1000, 1000)) # Write to Zarr (parallel across chunks) da.to_zarr(data, "parallel_output.zarr") # Verify z = da.from_zarr("parallel_output.zarr") print(f"Written: {z.shape}, {z.npartitions} partitions") ``` ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | Slow read performance | Chunk shape misaligned with access pattern | Profile access pattern; realign chunks (row-access → wide chunks) | | `MemoryError` on read | Loading entire array or chunk too large | Use Dask `da.from_zarr()` for out-of-core; reduce chunk size | | High cloud latency | Many small metadata reads | Call `zarr.consolidate_metadata(store)` then `open_consolidated()` | | Corrupted ZIP store | Forgot to call `store.close()` | Always close ZipStore after write; use context manager | | Concurrent write conflicts | Multiple processes writing overlapping chunks | Use `ProcessSynchronizer` or ensure non-overlapping chunk writes | | Poor compression ratio | No shuffle on numeric data | Add `shuffle="shuffle"` to BloscCodec | | Stale consolidated metadata | Arrays modified after consolidation | Re-run `zarr.consolidate_metadata()` after updates | | `ModuleNotFoundError: s3fs` | Missing cloud storage dependency | `pip install s3fs` (S3) or `pip install gcsfs` (GCS) | ## Related Skills - **lamindb-data-management** — uses Zarr as a storage backend; provides data management, lineage, and ontology validation on top of Zarr arrays - **anndata-annotated-data** — AnnData objects can be backed by Zarr stores for out-of-core single-cell data - **sympy-symbolic-math** — unrelated but co-exists in scientific-computing category ## References - [Zarr Python documentation](https://zarr.readthedocs.io/) — official user guide and API reference - [Zarr specifications](https://zarr-specs.readthedocs.io/) — file format specification (V2, V3) - [Zarr GitHub](https://github.com/zarr-developers/zarr-python) — source code, issues, changelog - [NumCodecs](https://numcodecs.readthedocs.io/) — compression codec library used by Zarr