--- name: architecture-design description: | Use ONLY when creating NEW registrable components in ML projects that require Factory/Registry patterns. ✅ USE when: - Creating a new Dataset class (needs @register_dataset) - Creating a new Model class (needs @register_model) - Creating a new module directory with __init__.py factory - Initializing a new ML project structure from scratch - Adding new component types (Augmentation, CollateFunction, Metrics) ❌ DO NOT USE when: - Modifying existing functions or methods - Fixing bugs in existing code - Adding helper functions or utilities - Refactoring without adding new registrable components - Simple code changes to a single file - Modifying configuration files - Reading or understanding existing code Key indicator: Does the task require @register_* decorator or Factory pattern? If no, skip this skill. version: 1.2.0 --- # Architecture Design - ML Project Template This skill defines the standard code architecture for machine learning projects based on the template structure. When modifying or extending code, follow these patterns to maintain consistency. ## Overview The project follows a modular, extensible architecture with clear separation of concerns. Each module (data, model, trainer, analysis) is independently organized using factory and registry patterns for maximum flexibility. ## Core Design Patterns ### Factory Pattern Each module uses a factory to create instances dynamically: ```python # Example from data_module/dataset/__init__.py DATASET_FACTORY: Dict = {} def DatasetFactory(data_name: str): dataset = DATASET_FACTORY.get(data_name, None) if dataset is None: print(f"{data_name} dataset is not implementation, use simple dataset") dataset = DATASET_FACTORY.get('simple') return dataset ``` For detailed guidance, refer to `references/factory_pattern.md`. ### Registry Pattern Components register themselves via decorators: ```python # Example from data_module/dataset/simple_dataset.py @register_dataset("simple") class SimpleDataset(Dataset): def __init__(self, data): self.data = data ``` For detailed guidance, refer to `references/registry_pattern.md`. ### Auto-Import Pattern Modules automatically discover and import submodules: ```python # Example from data_module/dataset/__init__.py models_dir = os.path.dirname(__file__) import_modules(models_dir, "src.data_module.dataset") ``` For detailed guidance, refer to `references/auto_import.md`. ## Directory Structure ``` project/ ├── run/ │ ├── pipeline/ # Main workflow scripts │ │ ├── training/ # Training pipelines │ │ ├── prepare_data/ # Data preparation pipelines │ │ └── analysis/ # Analysis pipelines │ └── conf/ # Hydra configuration files │ ├── training/ # Training configs │ ├── dataset/ # Dataset configs │ ├── model/ # Model configs │ ├── prepare_data/ # Data prep configs │ └── analysis/ # Analysis configs │ ├── src/ │ ├── data_module/ # Data processing module │ │ ├── dataset/ # Dataset implementations │ │ ├── augmentation/ # Data augmentation │ │ ├── collate_fn/ # Collate functions │ │ ├── compute_metrics/ # Metrics computation │ │ ├── prepare_data/ # Data preparation logic │ │ ├── data_func/ # Data utility functions │ │ └── utils.py # Module-specific utilities │ │ │ ├── model_module/ # Model implementations │ │ ├── brain_decoder/ # Brain decoder models │ │ └── model/ # Alternative model location │ │ │ ├── trainer_module/ # Training logic │ ├── analysis_module/ # Analysis and evaluation │ ├── llm/ # LLM-related code │ └── utils/ # Shared utilities │ ├── data/ │ ├── raw/ # Original, immutable data │ ├── processed/ # Cleaned, transformed data │ └── external/ # Third-party data │ ├── outputs/ │ ├── logs/ # Training and evaluation logs │ ├── checkpoints/ # Model checkpoints │ ├── tables/ # Result tables │ └── figures/ # Plots and visualizations │ ├── pyproject.toml # Project configuration ├── uv.lock # Dependency lock file ├── TODO.md # Task tracking ├── README.md # Project documentation └── .gitignore # Git ignore rules ``` For detailed directory structure with file descriptions, refer to `references/structure.md`. ## Module Organization ### Creating a New Dataset When adding a new dataset: 1. Create file in `src/data_module/dataset/` 2. Use `@register_dataset("name")` decorator 3. Inherit from `torch.utils.data.Dataset` 4. Implement `__init__`, `__len__`, `__getitem__` ```python from torch.utils.data import Dataset from typing import Dict import torch from src.data_module.dataset import register_dataset @register_dataset("custom") class CustomDataset(Dataset): def __init__(self, data): self.data = data def __len__(self): return len(self.data) def __getitem__(self, i: int) -> Dict[str, torch.Tensor]: return self.data[i] ``` ### Creating a New Model **CRITICAL: Models use config-driven pattern** When adding a new model: 1. Create file in `src/model_module/model/` or appropriate module subdirectory 2. Use `@register_model('ModelName')` decorator 3. `__init__` accepts **ONLY** `cfg` parameter - all hyperparameters come from config 4. `forward()` returns dict: `{"loss": loss, "labels": labels, "logits": logits}` 5. Handle training vs inference modes using `self.training` ```python from src.model_module.brain_decoder import register_model @register_model('MyModel') class MyModel(nn.Module): def __init__(self, cfg): super().__init__() self.cfg = cfg self.task = cfg.dataset.task # ALL parameters from cfg self.hidden_dim = cfg.model.hidden_dim self.output_dim = cfg.dataset.target_size[cfg.dataset.task] def forward(self, x, labels=None, **kwargs): if self.training: # Training logic pass else: # Inference logic pass return {"loss": loss, "labels": labels, "logits": logits} ``` ### Adding Data Augmentation When adding augmentation: 1. Create file in `src/data_module/augmentation/` 2. Implement transformation function 3. Register with factory if needed ## Code Style Guidelines For comprehensive style guidelines, refer to `references/code_style.md`. **Key principles:** - Always use type hints for function signatures - Follow import order: standard library → third-party → local - Module `__init__.py` files contain factory/registry logic - Model classes must be config-driven ## Configuration Management The project uses Hydra for configuration management: - Config files in `run/conf/` organize by module - Each stage (training, analysis) has its own config structure - Use YAML files for all configuration ## When Working on This Project ### Before Modifying Code 1. Read the relevant module's factory/registry pattern 2. Check existing implementations for consistency 3. Follow the established directory structure 4. Use registration decorators for new components ### Adding New Features 1. Determine which module the feature belongs to 2. Check if similar functionality exists 3. Follow factory/registry pattern if creating new component types 4. Add configuration files if needed 5. Update documentation ### Code Review Checklist - [ ] Uses factory/registry pattern appropriately - [ ] Follows module directory structure - [ ] Has proper type annotations - [ ] Imports are correctly ordered - [ ] Registration decorator is used - [ ] Configuration files are added if needed ## Additional Resources ### Reference Files For detailed information, consult: - **`references/structure.md`** - Detailed directory structure with file descriptions - **`references/factory_pattern.md`** - Factory pattern in-depth explanation - **`references/registry_pattern.md`** - Registry pattern in-depth explanation - **`references/auto_import.md`** - Auto-import pattern in-depth explanation - **`references/code_style.md`** - Comprehensive code style guidelines ### Example Files Working examples in `examples/`: - **`examples/custom_dataset.py`** - Custom dataset implementation - **`examples/custom_model.py`** - Custom model implementation - **`examples/augmentation_example.py`** - Data augmentation example - **`examples/config_example.yaml`** - Configuration file example - **`examples/pipeline_example.sh`** - Pipeline script example