# Claude Development Notes for scida

## Project Overview

**scida** is a Python library for scalable analysis of large scientific datasets, primarily targeting astrophysics simulations (particle-based cosmological simulations) and observational data. It provides a unified interface to load, analyze, and process big data using **dask** for parallel/lazy computation.

**Key Features:**
- Unified high-level interface via `scida.load()`
- Parallel, task-based data processing with dask arrays
- Physical unit support via pint
- Extensible architecture with mixins and custom dataset classes
- Support for HDF5, zarr, and FITS file formats

## Project Setup

This project uses **uv** for dependency management. All Python commands should be prefixed with `uv run`.

### Installation
```bash
uv sync                    # Install all dependencies
uv sync --all-groups       # Include dev dependencies
```

### Running Commands
```bash
uv run python -m pytest tests/test_init.py -v   # Run specific test
uv run python -c "import scida; print('test')"  # Run Python code
uv run mypy src/                                 # Type checking
uv run black src/                                # Code formatting
uv run isort src/                                # Import sorting
```

### Common Development Commands
| Task | Command |
|------|---------|
| Run all tests | `uv run pytest` |
| Run tests with coverage | `uv run coverage run -m pytest` |
| Run specific test file | `uv run pytest tests/test_interface.py -v` |
| Build docs locally | `make servedocs` (serves at localhost:8000) |
| Deploy docs | `make publicdocs` |
| Version bump | `make version v=patch` (or minor/major) |

### Git Worktree Workflow for Feature Development

**Always develop features on separate branches using git worktrees.** This keeps the main working directory clean and allows parallel work on multiple features.

```bash
# Create a new feature branch and worktree
git branch feature/my-feature main
git worktree add branches/my-feature feature/my-feature

# Work in the worktree
cd branches/my-feature
# ... make changes, commit, etc.

# When done, clean up
cd /home/cbyrohl/repos/scida
git worktree remove branches/my-feature
git branch -d feature/my-feature  # after merging
```

**Directory structure:**
```
/home/cbyrohl/repos/scida/
├── branches/                    # All worktrees live here
│   ├── fix-validate-path-types/ # Example worktree
│   └── feature-xyz/             # Another worktree
├── src/                         # Main repo (stay on main)
└── ...
```

**Key rules:**
- Never commit feature work directly to `main` in the main repo
- Each refactoring task in `refactoring_tasks/` specifies its branch name
- Run tests in the worktree before creating PRs
- The `branches/` directory is gitignored

## Architecture Overview

### Directory Structure
```
src/scida/
├── __init__.py          # Package entry point, imports key classes
├── convenience.py       # load() function - main user entry point
├── interface.py         # BaseDataset, Dataset, Selector base classes
├── fields.py            # FieldContainer - data storage abstraction
├── series.py            # DatasetSeries - collection of datasets
├── registries.py        # Type registries for auto-detection
├── discovertypes.py     # Type detection logic
├── config.py            # Configuration management
├── io/                  # I/O backends (HDF5, zarr, FITS)
│   ├── _base.py         # ChunkedHDF5Loader, load functions
│   └── fits.py          # FITS file support
├── interfaces/mixins/   # Composable functionality
│   ├── units.py         # UnitMixin - pint unit support
│   ├── cosmology.py     # CosmologyMixin - cosmological parameters
│   └── spatial.py       # SpatialCartesian3DMixin
├── customs/             # Simulation-specific implementations
│   ├── gadgetstyle/     # Base Gadget-style simulations
│   ├── arepo/           # Arepo/Illustris/TNG simulations
│   ├── swift/           # SWIFT simulations
│   ├── gizmo/           # GIZMO/FIRE simulations
│   └── rockstar/        # Rockstar halo catalogs
└── configfiles/         # YAML config files (units, simulations)
```

### Core Classes

#### `BaseDataset` / `Dataset` (interface.py)
- Base class for all datasets
- Uses `MixinMeta` metaclass for dynamic mixin composition
- Auto-registers subclasses in `dataset_type_registry`
- Key methods: `__init__()`, `validate_path()`, `save()`, `info()`

#### `FieldContainer` (fields.py)
- MutableMapping that stores dask arrays and supports lazy field evaluation
- Supports nested containers (e.g., `data["PartType0"]["Coordinates"]`)
- Field recipes: functions that compute derived fields on-demand
- Key methods: `register_field()`, `keys()`, `get_dataframe()`

#### `DatasetSeries` (series.py)
- Container for multiple related datasets (e.g., simulation snapshots)
- Lazy initialization via `delay_init` decorator
- Metadata caching for efficient series navigation

#### Mixins (interfaces/mixins/)
- **UnitMixin**: Adds pint unit support, reads unit config files
- **CosmologyMixin**: Cosmological parameter handling (redshift, scale factor)
- **SpatialCartesian3DMixin**: Spatial operations in Cartesian coordinates

### Type Detection Flow

1. `scida.load(path)` is called
2. `_determine_type()` iterates through registered types
3. Each type's `validate_path()` classmethod is called
4. Most specific matching type is selected (via MRO analysis)
5. `_determine_mixins()` adds appropriate mixins
6. Dataset instance is created with dynamic class composition

### Registry Pattern
```python
# In registries.py
dataset_type_registry: Dict[str, Type] = {}
dataseries_type_registry: Dict[str, Type] = {}
mixin_type_registry: Dict[str, Type] = {}

# Auto-registration via __init_subclass__
class MyDataset(Dataset):  # Automatically registered
    @classmethod
    def validate_path(cls, path, **kwargs):
        # Return CandidateStatus.MAYBE or CandidateStatus.NO
        ...
```

## Creating Custom Dataset Classes

### Pattern for New Dataset Types
```python
from scida.discovertypes import CandidateStatus
from scida.interface import Dataset

class MyCustomDataset(Dataset):
    """Custom dataset for specific simulation type."""

    @classmethod
    def validate_path(cls, path, *args, **kwargs) -> CandidateStatus:
        """Check if path matches this dataset type."""
        # Check file structure, metadata, etc.
        if is_valid:
            return CandidateStatus.MAYBE
        return CandidateStatus.NO

    def __init__(self, path, **kwargs):
        super().__init__(path, **kwargs)
        # Custom initialization
        # Access metadata via self._metadata_raw
        # Access data via self.data (FieldContainer)
```

### Registering Derived Fields
```python
@snap.data.register_field("PartType0", name="Temperature")
def temperature(container, snap=None, ureg=None):
    """Compute temperature from internal energy."""
    u = container["InternalEnergy"]
    # ... computation ...
    return temperature_array
```

## Testing

### Test Structure
```
tests/
├── conftest.py              # Fixtures, test configuration
├── helpers.py               # Dummy file generators (DummyGadgetFile, etc.)
├── testdata_properties.py   # Test data requirement decorators
├── test_interface.py        # Core interface tests
├── test_units.py            # Unit handling tests
├── customs/                 # Simulation-specific tests
│   ├── test_arepo.py
│   ├── test_swift.py
│   └── ...
└── simulations/             # Integration tests with real data
```

### Test Data Handling
- Tests use `@require_testdata` and `@require_testdata_path` decorators
- Test data path set via `SCIDA_TESTDATA_PATH` environment variable
- Dummy files created via `DummyGadgetSnapshotFile`, `DummyTNGFile` classes
- Cache directory isolated per test via `cachedir` fixture

### Running Tests
```bash
# Run all tests
uv run pytest

# Run with verbose output
uv run pytest -v

# Run specific test file
uv run pytest tests/test_interface.py

# Run tests matching pattern
uv run pytest -k "arepo"

# Run with coverage
uv run pytest --cov=scida

# Run via nox (multi-Python testing)
nox -s tests
```

### Nox Sessions
- `tests`: Run tests across Python 3.9-3.12
- `tests_numpy_versions`: Test with different NumPy versions
- `coverage`: Generate coverage report

## Configuration

### User Configuration
- Location: `~/.config/scida/config.yaml`
- Copied from `scida/configfiles/config.yaml` on first use
- Override with `SCIDA_CONFIG_PATH` environment variable

### Environment Variables
| Variable | Purpose |
|----------|---------|
| `SCIDA_CONFIG_PATH` | Override config file location |
| `SCIDA_CACHE_PATH` | Override cache directory |
| `SCIDA_TESTDATA_PATH` | Test data directory |

### Unit Configuration
- General units: `scida/configfiles/units/general.yaml`
- Simulation-specific: Pass `unitfile` parameter to `load()`
- Unit definitions use pint format

## Documentation

### Building Docs
```bash
make servedocs     # Serve locally with hot reload
make localdocs     # Build static site
make publicdocs    # Deploy to GitHub Pages
```

### Documentation Stack
- **MkDocs** with Material theme
- **mkdocstrings** for API docs (NumPy docstring style)
- **mkdocs-jupyter** for notebook integration

## Code Conventions

### Style Guidelines
- **Formatter**: black
- **Import sorting**: isort
- **Linting**: flake8
- **Type hints**: Used throughout, check with mypy

### Docstring Format (NumPy style)
```python
def function(param1: str, param2: int = 0) -> bool:
    """
    Short description.

    Parameters
    ----------
    param1 : str
        Description of param1.
    param2 : int, optional
        Description of param2.

    Returns
    -------
    bool
        Description of return value.
    """
```

### Logging
```python
import logging
log = logging.getLogger(__name__)
log.debug("Debug message")
log.info("Info message")
```

## Key Patterns

### Lazy Evaluation
- Data loaded as dask arrays (lazy by default)
- Call `.compute()` to materialize results
- Field recipes evaluated on first access

### Virtual Caching
- HDF5 datasets loaded as virtual datasets when possible
- Reduces memory footprint for multi-file datasets
- Controlled via `virtualcache` parameter

### Mixin Composition
```python
# Mixins dynamically composed at load time
instance = cls(path, mixins=[UnitMixin, CosmologyMixin], **kwargs)

# Results in dynamically created class:
# DatasetWithUnitMixinAndCosmologyMixin
```

### CandidateStatus for Type Detection
```python
from scida.discovertypes import CandidateStatus

class MyDataset(Dataset):
    @classmethod
    def validate_path(cls, path, **kwargs):
        if definitely_this_type:
            return CandidateStatus.YES
        if possibly_this_type:
            return CandidateStatus.MAYBE
        return CandidateStatus.NO
```

## Common Gotchas

1. **Always use `uv run`** for Python commands
2. **Dask arrays are lazy** - use `.compute()` to get values
3. **Units via pint** - access magnitude with `.magnitude`
4. **Test data required** - many tests need `SCIDA_TESTDATA_PATH` set
5. **Metadata in `_metadata_raw`** - raw HDF5 attributes stored here
6. **Field recipes vs fields** - recipes are functions, fields are arrays

## Useful Development Patterns

### Inspecting Dataset Structure
```python
ds = scida.load("path/to/data")
ds.info()  # Print structure overview
ds.data.keys()  # List top-level containers
ds.data["PartType0"].keys()  # List fields
```

### Debugging Type Detection
```python
import logging
logging.basicConfig(level=logging.DEBUG)
ds = scida.load("path/to/data")
# Check logs for type detection info
```

### Working with Units
```python
ds = scida.load("path", units="cgs")  # Load with CGS units
arr = ds.data["PartType0"]["Coordinates"]
arr.units  # Check units
arr.magnitude  # Get raw values
arr.to("kpc")  # Convert units
```