--- name: embedding-engine description: Embedding backends (InsightFace/PyTorch+ONNXRuntime vs TensorRT). Use when optimizing embedding throughput or debugging drift/fallbacks. --- # Embedding Engine Skill Use this skill to optimize embedding performance and debug embedding drift/fallback behavior. ## When to Use - Embedding pipeline running slowly - Need to switch between PyTorch and TensorRT - Debugging embedding drift between backends - Building/caching TensorRT engines - Verifying ONNXRuntime/CoreML provider selection (macOS) ## Sub-agents | Sub-agent | Purpose | |-----------|---------| | **PyTorchEmbeddingSubagent** | Reference ArcFace (training/validation) | | **TensorRTEmbeddingSubagent** | GPU-optimized TRT inference | | **ONNXEmbeddingSubagent** | Future ONNXRuntime C++ service (planned) | ## Current Backends - **`pytorch` (default):** ArcFace via the `insightface` Python package (used by `tools/episode_run.py`) - **`tensorrt` (optional):** TensorRT engine build + inference via `FEATURES/arcface_tensorrt/` ## Key Skills ### Embed faces with the configured backend Run embedding with the configured backend (same interface as the pipeline). ```python from tools.episode_run import get_embedding_backend embedder = get_embedding_backend( backend_type="pytorch", # or "tensorrt" device="cpu", tensorrt_config="config/pipeline/arcface_tensorrt.yaml", allow_cpu_fallback=True, ) embedder.ensure_ready() embeddings = embedder.encode(face_crops) # (N, 512) L2-normalized ``` ### Build a TensorRT engine from ONNX ```bash python -m FEATURES.arcface_tensorrt --mode build --onnx-path models/arcface_r100_v1.onnx ``` ### Compare TensorRT vs PyTorch embeddings (parity + speedup) ```bash python -m FEATURES.arcface_tensorrt --mode compare --n-samples 100 ``` This uses `FEATURES/arcface_tensorrt/src/embedding_compare.py` and reports cosine similarity + L2 distance stats. ## Config Reference **File:** `config/pipeline/embedding.yaml` | Key | Default | Description | |-----|---------|-------------| | `embedding.backend` | `pytorch` | Backend: `pytorch` or `tensorrt` | | `embedding.tensorrt_config` | `config/pipeline/arcface_tensorrt.yaml` | TensorRT config path | | `validation.max_drift_cosine` | 0.001 | Drift tolerance (behavior depends on runtime) | **File:** `config/pipeline/arcface_tensorrt.yaml` | Key | Default | Description | |-----|---------|-------------| | `arcface_tensorrt.enabled` | false | Sandbox feature flag (engine must exist) | | `tensorrt.precision` | fp16 | Engine precision | | `tensorrt.max_batch_size` | 32 | Max batch for engine build | | `tensorrt.workspace_size_mb` | 1024 | TRT workspace | | `tensorrt.engine_s3_bucket` | null | Optional engine bucket | ## Engine Storage TensorRT engines are GPU-architecture specific. Stored in S3: ``` s3://screenalytics-models/engines/ ├── arcface_r100-fp16-sm75.plan # Ampere (RTX 30xx) ├── arcface_r100-fp16-sm80.plan # A100 ├── arcface_r100-fp16-sm86.plan # Ada (RTX 40xx) └── arcface_r100-fp16-sm89.plan # Hopper (H100) ``` **Naming convention:** `{model_name}-{precision}-sm{arch}.plan` ## Common Issues ### "Engine not found" / TensorRT backend won’t load **Cause:** No engine built for the current GPU / config mismatch **Fix:** Build locally: ```bash python -m FEATURES.arcface_tensorrt --mode build --onnx-path models/arcface_r100_v1.onnx ``` ### Embedding drift too high **Cause:** FP16 quantization or TRT optimization changes **Check:** Run parity compare: ```bash python -m FEATURES.arcface_tensorrt --mode compare --n-samples 100 ``` **Fix:** Use FP32 precision: ```yaml tensorrt: precision: fp32 # default is fp16 ``` ### TensorRT slower than expected / falling back **Cause:** Not batching, engine built with suboptimal shapes/precision, or backend fell back **Check:** Ensure `config/pipeline/embedding.yaml` has `embedding.backend: tensorrt` and re-run with `--mode benchmark`. **Fix:** Increase batch size, ensure GPU backend: ```yaml tensorrt: opt_batch_size: 32 max_batch_size: 64 ``` ### Out of GPU memory **Cause:** Engine workspace too large **Check:** `nvidia-smi` during inference **Fix:** Reduce workspace: ```yaml tensorrt: workspace_size_mb: 512 # default is 1024 ``` ## Benchmark Reference | Backend | Batch | Throughput | Latency | VRAM | |---------|-------|------------|---------|------| | PyTorch | 32 | ~50 fps | ~640ms | 2GB | | TensorRT FP16 | 32 | ~250 fps | ~128ms | 1GB | | TensorRT FP32 | 32 | ~180 fps | ~178ms | 1.5GB | ## Diagnostic Output ```json { "backend": "tensorrt", "engine_path": "~/.cache/screenalytics/engines/arcface_r100_v1-sm86.trt", "precision": "fp16", "batch_size": 32, "embedding_dim": 512, "throughput_fps": 245.3, "latency_ms": 130.5, "vram_mb": 1024, "validation": { "drift_vs_pytorch": 0.9995, "regression_test": "passed" } } ``` ## Key Files | File | Purpose | |------|---------| | `tools/episode_run.py` | Pipeline embedding backend selection (`get_embedding_backend`) | | `FEATURES/arcface_tensorrt/src/tensorrt_builder.py` | Engine build/cache + optional S3 | | `FEATURES/arcface_tensorrt/src/tensorrt_inference.py` | TensorRT inference wrapper | | `FEATURES/arcface_tensorrt/src/embedding_compare.py` | Parity + speedup compare utilities | | `config/pipeline/embedding.yaml` | Backend selection + validation knobs | | `config/pipeline/arcface_tensorrt.yaml` | TensorRT builder/runtime config | | `FEATURES/arcface_tensorrt/tests/test_tensorrt_embedding.py` | Unit tests (synthetic) | | `tests/ml/test_arcface_embeddings.py` | ML-gated embedding invariants | ## Related Skills - [pipeline-insights](../pipeline-insights/SKILL.md) - General pipeline debugging - [face-alignment](../face-alignment/SKILL.md) - Alignment before embedding