--- name: runtime-skills description: Universal Runtime best practices for PyTorch inference, Transformers models, and FastAPI serving. Covers device management, model loading, memory optimization, and performance tuning. allowed-tools: Read, Grep, Glob user-invocable: false --- # Universal Runtime Skills Best practices and code review checklists for the Universal Runtime - LlamaFarm's local ML inference server. ## Overview The Universal Runtime provides OpenAI-compatible endpoints for HuggingFace models: - Text generation (Causal LMs: GPT, Llama, Mistral, Qwen) - Text embeddings (BERT, sentence-transformers, ModernBERT) - Classification, NER, and reranking - OCR and document understanding - Anomaly detection **Directory**: `runtimes/universal/` **Python**: 3.11+ **Key Dependencies**: PyTorch, Transformers, FastAPI, llama-cpp-python ## Links to Shared Skills This skill extends the shared Python practices. Always apply these first: | Topic | File | Priority | |-------|------|----------| | Patterns | [python-skills/patterns.md](../python-skills/patterns.md) | Medium | | Async | [python-skills/async.md](../python-skills/async.md) | High | | Typing | [python-skills/typing.md](../python-skills/typing.md) | Medium | | Testing | [python-skills/testing.md](../python-skills/testing.md) | Medium | | Errors | [python-skills/error-handling.md](../python-skills/error-handling.md) | High | | Security | [python-skills/security.md](../python-skills/security.md) | Critical | ## Runtime-Specific Checklists | Topic | File | Key Points | |-------|------|------------| | PyTorch | [pytorch.md](pytorch.md) | Device management, dtype, memory cleanup | | Transformers | [transformers.md](transformers.md) | Model loading, tokenization, inference | | FastAPI | [fastapi.md](fastapi.md) | API design, streaming, lifespan | | Performance | [performance.md](performance.md) | Batching, caching, optimizations | ## Architecture ``` runtimes/universal/ ├── server.py # FastAPI app, model caching, endpoints ├── core/ │ └── logging.py # UniversalRuntimeLogger (structlog) ├── models/ │ ├── base.py # BaseModel ABC with device management │ ├── language_model.py # Transformers text generation │ ├── gguf_language_model.py # llama-cpp-python for GGUF │ ├── encoder_model.py # Embeddings, classification, NER, reranking │ └── ... # OCR, anomaly, document models ├── routers/ │ └── chat_completions/ # Chat completions with streaming ├── utils/ │ ├── device.py # Device detection (CUDA/MPS/CPU) │ ├── model_cache.py # TTL-based model caching │ ├── model_format.py # GGUF vs transformers detection │ └── context_calculator.py # GGUF context size computation └── tests/ ``` ## Key Patterns ### 1. Model Loading with Double-Checked Locking ```python _model_load_lock = asyncio.Lock() async def load_encoder(model_id: str, task: str = "embedding"): cache_key = f"encoder:{task}:{model_id}" if cache_key not in _models: async with _model_load_lock: # Double-check after acquiring lock if cache_key not in _models: model = EncoderModel(model_id, device, task=task) await model.load() _models[cache_key] = model return _models.get(cache_key) ``` ### 2. Device-Aware Tensor Operations ```python class BaseModel(ABC): def get_dtype(self, force_float32: bool = False): if force_float32: return torch.float32 if self.device in ("cuda", "mps"): return torch.float16 return torch.float32 def to_device(self, tensor: torch.Tensor, dtype=None): # Don't change dtype for integer tensors if tensor.dtype in (torch.int32, torch.int64, torch.long): return tensor.to(device=self.device) dtype = dtype or self.get_dtype() return tensor.to(device=self.device, dtype=dtype) ``` ### 3. TTL-Based Model Caching ```python _models: ModelCache[BaseModel] = ModelCache(ttl=300) # 5 min TTL async def _cleanup_idle_models(): while True: await asyncio.sleep(CLEANUP_CHECK_INTERVAL) for cache_key, model in _models.pop_expired(): await model.unload() ``` ### 4. Async Generation with Thread Pools ```python # GGUF models use blocking llama-cpp, run in executor self._executor = ThreadPoolExecutor(max_workers=1) async def generate(self, messages, max_tokens=512, ...): loop = asyncio.get_running_loop() return await loop.run_in_executor(self._executor, self._generate_sync) ``` ## Review Priority When reviewing Universal Runtime code: 1. **Critical** - Security - Path traversal prevention in file endpoints - Input sanitization for model IDs 2. **High** - Memory & Device - Proper CUDA/MPS cache clearing on unload - torch.no_grad() for inference - Correct dtype for device 3. **Medium** - Performance - Model caching patterns - Batch processing where applicable - Streaming implementation 4. **Low** - Code Style - Consistent with patterns.md - Proper type hints