# Backends

**`versionable`** provides the option of several backends, each targeting a different trade-off between
human-readability, interoperability with other tools, and performance with large binary data.

You never have to instantiate a backend directly — **`versionable`** picks the right one automatically based on the file
extension you pass to `save()` / `load()`. If you use `.json`, you get JSON. If you use `.toml`, you get TOML. The same
object can be saved and loaded with different backends just by changing the filename extension, which makes it easy to
migrate between formats or write tests against a lighter-weight backend than you use in production.

| Extension       | Backend       | Best for                             |
| --------------- | ------------- | ------------------------------------ |
| `.yaml`, `.yml` | `YamlBackend` | Config files, data-science workflows |
| `.json`         | `JsonBackend` | Simple data, interoperability        |
| `.toml`         | `TomlBackend` | Human-editable config files          |
| `.h5`, `.hdf5`  | `Hdf5Backend` | Large numpy arrays, lazy loading     |

All backends store the same schema metadata (`object`, `version`, `hash` inside the `__versionable__` envelope)
alongside your data, so `load()` can validate the schema and apply migrations regardless of which backend wrote the
file.

## Feature comparison

| Feature               | YAML | JSON | TOML | HDF5  |
| --------------------- | ---- | ---- | ---- | ----- |
| Human-readable        | Yes  | Yes  | Yes  | No    |
| `None` / `null`       | Yes  | Yes  | No   | Yes   |
| Comment-out defaults  | Yes  | No   | Yes  | No    |
| Nested objects        | Yes  | Yes  | Yes  | Yes   |
| Large numpy arrays    | Slow | Slow | Slow | Fast  |
| Lazy loading          | No   | No   | No   | Yes   |
| Hand-editable         | Good | Fair | Best | No    |
| External tool support | Wide | Wide | Good | Niche |

## YAML

YAML is a good choice when you want human-readable files with support for comments (added by hand), `null` values, and a
syntax that is already familiar in data-science and DevOps workflows. Unlike TOML, YAML handles `None` natively — fields
with `None` survive the round-trip without any special treatment.

```python
versionable.save(config, "config.yaml")
loaded = versionable.load(SensorConfig, "config.yaml")
```

Produces:

```yaml
name: probe-A
sampleRate_Hz: 120000
channels:
  - 0
  - 1
  - 2
__versionable__:
  object: SensorConfig
  version: 1
  hash: 9d6951
```

Both `.yaml` and `.yml` extensions are supported.

Metadata is stored in a `__versionable__` mapping at the end of the file — your data comes first, schema metadata stays
out of the way.

### Missing fields

Any field absent from the file is filled in from the dataclass default on load. This means older files with fewer fields
load cleanly as new fields are added to the schema (as long as those fields have defaults).

### Comment-Out Defaults

Pass `commentDefaults=True` when saving to comment out fields whose values match the class default. This is useful for
config files where you want users to see all available options without all of them being "active":

```python
versionable.save(config, "config.yaml", commentDefaults=True)
```

```yaml
name: probe-A
sampleRate_Hz: 120000
# channels:
# - 0
# - 1
# - 2
__versionable__:
  object: SensorConfig
  version: 1
  hash: 9d6951
```

## JSON

JSON is the most common choice when the file will be read by tools outside of Python — a web service, a JavaScript
front-end, or a data pipeline that expects a standard format. It handles all primitive types, lists, and nested objects,
and the output is human-readable even if not particularly easy to hand-edit.

```python
versionable.save(config, "config.json")
loaded = versionable.load(SensorConfig, "config.json")
```

The output includes schema metadata alongside the data:

```json
{
  "__versionable__": {
    "object": "SensorConfig",
    "version": 1,
    "hash": "9d6951"
  },
  "name": "probe-A",
  "sampleRate_Hz": 120000,
  "channels": [0, 1, 2]
}
```

## TOML

TOML is the best choice for configuration files that users will open and edit by hand. The format is designed to be
obvious at a glance, supports comments (via `commentDefaults`), and maps cleanly to nested sections. If your dataclass
represents application settings that ship with the software and users are expected to tweak, prefer TOML over JSON.

```python
versionable.save(config, "config.toml")
loaded = versionable.load(SensorConfig, "config.toml")
```

Produces human-readable TOML:

```toml
name = "probe-A"
sampleRate_Hz = 120000
channels = [0, 1, 2]

[__versionable__]
object = "SensorConfig"
version = 1
hash = "9d6951"
```

Fields come first deliberately — if a user opens the file to hand-edit a value, the data is right at the top and the
schema metadata stays out of the way at the bottom.

### Missing fields and None values

TOML is flexible about missing keys — any field absent from the file is silently filled in from the dataclass default on
load. This means you can hand-edit a config file and freely delete any line whose value you want to reset to default,
and it will just work. It also means older files with fewer fields load cleanly as new fields are added to the schema
(as long as those fields have defaults).

**We recommend that every field in a class saved to TOML defines a default value.** Required fields (no default) work
fine for new files, but they become a liability the moment a file is hand-edited, partially written, or migrated from an
older schema version — any of which can leave the field absent, causing load to fail.

The one case to be careful about is `None`. TOML has no native `null` type, so a field holding `None` is omitted on save
— the same as a missing key. On load it is restored from the dataclass default, which is fine if a default exists. But
for a required field (no default), `None` at save time means the field disappears from the file and cannot be recovered
on load. JSON and YAML handle this safely because `null` is a first-class value that survives the round-trip. If your
schema has required optional fields that may genuinely be `None`, prefer YAML or JSON.

Nested `Versionable` objects become native TOML tables. For example, given:

```python
@dataclass
class RetryPolicy(Versionable, version=1, hash="f907a9"):
    retries: int = 3
    backoff_s: float = 1.0

@dataclass
class WorkerConfig(Versionable, version=1, hash="8bdfa7"):
    name: str = "worker"
    retry: RetryPolicy = field(default_factory=RetryPolicy)
```

The saved TOML looks like:

```toml
name = "worker"

[__versionable__]
object = "WorkerConfig"
version = 1
hash = "8bdfa7"

[retry]
retries = 3
backoff_s = 1.0

[retry.__versionable__]
object = "RetryPolicy"
version = 1
hash = "f907a9"
```

Each nested `Versionable` carries its own `__versionable__` sub-table — the same shape as the root envelope.

### Comment-Out Defaults

Pass `commentDefaults=True` to comment out fields whose values match the class default. This is useful for config files
where you want users to see all available options without all of them being "active":

```python
@dataclass
class SensorPreset(Versionable, version=1, hash="6f2809"):
    name: str = "sensor"
    sampleRate_Hz: int = 48000
    enabledChannels: list[int] = field(default_factory=lambda: [0, 1])

preset = SensorPreset(name="probe-A")  # only override name
versionable.save(preset, "preset.toml", commentDefaults=True)
```

```toml
name = "probe-A"
# sampleRate_Hz = 48000
# enabledChannels = [0, 1]

[__versionable__]
object = "SensorPreset"
version = 1
hash = "6f2809"
```

Fields at their default render as `#`-prefixed lines; users uncomment any line to override that default.

Note: hand-added comments in a TOML file are wiped on the next `save()` — the file is regenerated from the parsed Python
dict, which doesn't carry comments. Round-trip preservation of user-added comments is planned for a follow-up release.

## HDF5

HDF5 is the right choice when your dataclasses contain large numpy arrays — recordings, images, simulation outputs, or
any dataset where reading the whole file into memory upfront would be slow or wasteful. Unlike JSON and TOML, HDF5
stores arrays as binary compressed datasets, so a 100 MB array saves and loads in a fraction of the time it would take
as text.

The HDF5 backend depends on `h5py`, which in turn requires the HDF5 C library — a non-trivial native dependency that
adds significant installation overhead. It is therefore kept as an optional extra so that projects using only JSON or
TOML don't pay that cost.

**Installation:**

On most platforms (macOS, Windows, Linux x86_64), pip ships a pre-built wheel:

```bash
pip install "versionable[hdf5] @ git+https://github.com/hendrickmelo/versionable.git"
```

On Linux ARM or systems with an older glibc (e.g. RHEL 7), no pre-built wheel is available and pip will fall back to
building from source. Install the HDF5 system library first:

```bash
sudo apt install libhdf5-dev   # Debian/Ubuntu
sudo yum install hdf5-devel    # RHEL/CentOS
```

Then run the pip install above. Users on conda-based environments can skip this — conda manages the HDF5 C library as a
first-class package.

```python
import numpy as np
import numpy.typing as npt
from dataclasses import dataclass
import versionable
from versionable import Versionable

@dataclass
class Recording(Versionable, version=1, hash="..."):
    name: str
    sampleRate_Hz: int
    data: npt.NDArray[np.float64]

rec = Recording(name="capture-1", sampleRate_Hz=240000, data=np.random.rand(1_000_000))
versionable.save(rec, "recording.h5")
```

Every field maps to a native HDF5 construct:

| Python type                                           | HDF5 representation                            |
| ----------------------------------------------------- | ---------------------------------------------- |
| `int`, `float`, `bool`, `str`                         | Scalar attribute                               |
| `np.ndarray`                                          | Dataset (compressed)                           |
| `list[int]`, `list[float]`, `list[str]`, `list[bool]` | 1-D dataset                                    |
| `list[np.ndarray]`                                    | Group of integer-keyed datasets                |
| `dict[str, np.ndarray]`                               | Group of named datasets                        |
| Nested `Versionable`                                  | Subgroup with `__versionable__` metadata group |
| `list[Versionable]`                                   | Group of integer-keyed subgroups               |
| `None`                                                | `h5py.Empty` attribute                         |
| `Enum`                                                | Attribute (stores `.value`)                    |
| Converted types (datetime, Path, etc.)                | Attribute (converter output)                   |

Metadata (`object`, `version`, `hash`) is stored as attributes on a `__versionable__` child group at the root and inside
each nested Versionable subgroup. This distinguishes Versionable groups from plain collection groups. The `format`
attribute is reserved in this group for future versionable file format versioning.

Files are readable with h5dump, HDFView, MATLAB, or any HDF5-compatible tool. Reconstructing exact Python types (e.g.,
distinguishing `list[float]` from `np.ndarray`) requires the class's type annotations.

### Compression

By default, array datasets are compressed with **gzip (level 4)** for maximum compatibility across tools (MATLAB,
HDFView, h5py without plugins). You can change the algorithm and level per-save by passing a `compression` kwarg:

```python
from versionable.hdf5 import Hdf5Compression, BLOSC_DEFAULT, GZIP_DEFAULT, ZSTD_DEFAULT, UNCOMPRESSED

# Use a preset
versionable.save(rec, "recording.h5", compression=GZIP_DEFAULT)

# Or build a custom configuration
comp = Hdf5Compression(algorithm="zstd", level=9)
versionable.save(rec, "recording.h5", compression=comp)
```

Compression is a storage concern — it does not affect the schema hash and has no impact on `load()`. Any compressed file
can be read back regardless of what compression was used to write it, as long as the required filter is available.

Compression is set per-dataset at creation time. When resuming a session, appending to an existing dataset uses the
original dataset's compression filter, not the session's `compression` parameter. The session's compression only applies
to newly created datasets.

#### Available presets

| Preset          | Speed | Size | When to use                                                                    |
| --------------- | ----- | ---- | ------------------------------------------------------------------------------ |
| `GZIP_DEFAULT`  | 🐢    | 🗜️   | Default — universal compatibility                                              |
| `ZSTD_DEFAULT`  | 🚀    | 🗜️   | Good ratio and speed (requires hdf5plugin on reader)                           |
| `ZSTD_FAST`     | ⚡⚡  | 📦   | Write speed matters more than file size                                        |
| `ZSTD_BEST`     | 🐢    | 🗜️🗜️ | Archival — smallest files, slower writes                                       |
| `BLOSC_DEFAULT` | ⚡⚡  | 🗜️   | Large arrays — parallel blosc2 with zstd inside                                |
| `LZF`           | ⚡    | 📦   | Fastest round-trip when ratio matters less than compatibility with other tools |
| `UNCOMPRESSED`  | 🐰    | 📦📦 | Debugging, or data that doesn't compress well                                  |

#### Hdf5Compression fields

- **`algorithm`** — `"zstd"` | `"gzip"` | `"lzf"` | `"blosc"` | `None`. Default: `"gzip"`. Set to `None` for
  uncompressed.
- **`level`** — `int | None`. Default: `4`. Algorithm-specific level (zstd: 1–22, gzip: 0–9, blosc: 0–9).
- **`shuffle`** — `bool`. Default: `True`. Byte-shuffle filter (improves compression ratio for numeric data).
- **`bloscCompressor`** — `"zstd"` | `"blosclz"` | `"lz4"` | `"lz4hc"` | `"zlib"`. Default: `"zstd"`. Sub-compressor
  used when `algorithm="blosc"`.

The zstd and blosc algorithms are provided by the [hdf5plugin](https://hdf5plugin.readthedocs.io/en/latest/usage.html)
package, which is included in the `[hdf5]` extra. See the hdf5plugin docs for full details on filter parameters and
tuning options. The gzip and lzf algorithms are built into h5py and work without hdf5plugin.

The `BLOSC_DEFAULT` preset uses [blosc2](https://www.blosc.org/pages/blosc-in-depth/) — a meta-compressor that adds
parallel blocking, byte-shuffle, and cache-aligned chunking on top of the chosen sub-compressor. Buffer alignment and
block sizes are handled automatically.

#### Compatibility note

The default `GZIP_DEFAULT` produces files readable by every HDF5 implementation. The `ZSTD_*` presets (and
`BLOSC_DEFAULT`) produce files that require `hdf5plugin` on the reading side as well. Use them when all readers have the
plugin installed and you need better speed or ratio:

```python
versionable.save(rec, "recording.h5", compression=ZSTD_DEFAULT)
```

### Lazy Loading

By default, array fields are not read from disk until first access. This means `load()` returns almost instantly even
for large files — the array is fetched only when your code actually uses it:

```python
loaded = versionable.load(Recording, "recording.h5")
loaded.name    # Loaded immediately (scalar)
loaded.data    # Read from disk on first access, then cached
```

Lazy loading also works per-element for collection fields:

- **`list[np.ndarray]`** — returns a `LazyArrayList` where each element loads on indexing or iteration
- **`dict[str, np.ndarray]`** — returns a `LazyArrayDict` where each value loads on key access

```python
loaded = versionable.load(Experiment, "experiment.h5")
loaded.traces[0]         # Loads only the first trace
loaded.channels["ch0"]   # Loads only channel "ch0"
```

Lazy loading is particularly useful when you have many recordings on disk and only need to inspect metadata (name,
sample rate, channel count) before deciding which ones to process.

### Preload

If you know you'll need an array right away, you can opt into eager loading to avoid the latency hit at first access
time — useful when you're about to iterate over the data in a tight loop:

```python
# Preload specific fields
loaded = versionable.load(Recording, "recording.h5", preload=["data"])

# Preload all arrays
loaded = versionable.load(Recording, "recording.h5", preload="*")
```

### Metadata Only

Load only scalar fields and skip arrays entirely. Accessing an array field raises `ArrayNotLoadedError`. This is the
fastest possible load — ideal for scanning a directory of files to build an index or filter by metadata before loading
the full data:

```python
loaded = versionable.load(Recording, "recording.h5", metadataOnly=True)
loaded.name    # Works
loaded.data    # Raises ArrayNotLoadedError
```

### Save-As-You-Go Sessions

For scenarios where data arrives incrementally (DAQ streaming, simulation loops, long experiments),
`versionable.hdf5.open()` provides a file-backed session that persists mutations as they happen:

```python
from dataclasses import dataclass, field
import numpy as np
from numpy.typing import NDArray
import versionable
import versionable.hdf5
from versionable import Versionable

@dataclass
class Experiment(Versionable, version=1, hash="..."):
    name: str = ""
    sampleRate_Hz: float = 0.0
    traces: list[np.ndarray] = field(default_factory=list)
    timestamps: list[float] = field(default_factory=list)
    waveform: NDArray[np.float64] = field(default_factory=lambda: np.empty(0))

# You can pass a class (empty proxy) or an existing instance:
exp = Experiment(
    name="baseline",
    sampleRate_Hz=48000.0,
    traces=[],
    timestamps=[],
    waveform=np.empty((0, 1024)),
)

with versionable.hdf5.open(exp, "run001.h5") as exp:
    # All fields already persisted — just append
    for chunk in daq.stream():
        exp.traces.append(chunk.data)      # new dataset written to disk
        exp.timestamps.append(chunk.time)  # resizable dataset grows
        exp.waveform.append(chunk.raw)     # resizable dataset grows

# Load normally — no special API needed
exp = versionable.load(Experiment, "run001.h5")
```

All ndarray fields in a session are backed by resizable HDF5 datasets and wrapped with `DatasetArray`. Every ndarray
supports `.append()`, element writes (write-through to disk), `.resize()`, and numpy interop — no annotation required.

#### Session Modes

| Mode                 | Behavior                                              |
| -------------------- | ----------------------------------------------------- |
| `"create"` (default) | New file; error if file exists                        |
| `"overwrite"`        | Delete existing file if present, create new           |
| `"resume"`           | Open existing file, restore state, continue appending |
| `"read"`             | Open existing file read-only, no mutations allowed    |

```python
# Resume after a crash or between sessions
with versionable.hdf5.open(Experiment, "run001.h5", mode="resume") as exp:
    print(len(exp.traces))        # existing data is available
    exp.traces.append(new_data)   # appending continues from where it left off

# Read-only access — no mutations allowed
with versionable.hdf5.open(Experiment, "run001.h5", mode="read") as exp:
    print(np.mean(exp.waveform))  # numpy reads work
    # exp.name = "new"            # raises BackendError
    # exp.waveform[0] = 0         # raises BackendError
```

#### `Hdf5FieldInfo` — Optional Layout Hints

All ndarray fields are resizable by default. Use `Hdf5FieldInfo` only when you need to override the chunk size or append
axis:

```python
from typing import Annotated
from versionable import Hdf5FieldInfo

# Explicit axis (default: inferred from zero-size dimension, or 0)
channels: Annotated[np.ndarray, Hdf5FieldInfo(axis=1)]

# Custom chunk size (default: ~256 KB heuristic)
highRes: Annotated[np.ndarray, Hdf5FieldInfo(chunkRows=128)]
```

`Hdf5FieldInfo` is pure annotation metadata — it's ignored by `save()`/`load()` and non-HDF5 backends. The field hashes
identically to a plain `np.ndarray`.

#### Dtype Inference

The on-disk dtype is inferred from the field's type annotation:

```python
data: NDArray[np.float32]  # stored as float32 on disk, even if assigned float64
```

Bare `np.ndarray` fields use the assigned array's dtype.

#### Tracked Collections

- **`list[np.ndarray]`** — each `.append()` creates a new dataset in a group
- **`list[float]`** / **`list[str]`** — `.append()` resizes a 1-D dataset
- **`dict[str, np.ndarray]`** — `__setitem__` creates/replaces datasets in a group
- `insert`, `pop`, `remove`, `sort`, `reverse` raise `NotImplementedError` — build in memory and assign the whole list
  instead

#### `flush()` for Durability

These operations write through to disk automatically — no `flush()` needed:

- **`DatasetArray.__setitem__`** — `obj.data[50] = 42.0`
- **`DatasetArray.append()`** / **`resize()`**
- **`TrackedList.append()`** / **`extend()`** / **`__setitem__`**
- **`TrackedDict.__setitem__`** / **`__delitem__`** / **`update()`**
- **Scalar field assignment** — `obj.name = "new"`

`session.flush()` flushes HDF5 internal buffers to the OS, ensuring data reaches disk even if the process crashes
immediately after. Call it in long-running loops where you need a durability checkpoint:

```python
session = versionable.hdf5.open(MyClass, "out.h5")
with session as obj:
    for batch in data_source:
        obj.data.append(batch)
        session.flush()  # ensure data survives a crash
```

#### Limitations

Sessions do not support migrations. The file's version and hash must exactly match the class. If your schema has
changed, use `versionable.load()` (which supports migrations) to load the old file, then re-save with a new session.