# ADR-072: WiFlow Pose Estimation Architecture - **Status**: Proposed - **Date**: 2026-04-02 - **Deciders**: ruv - **Relates to**: ADR-071 (ruvllm Training Pipeline), ADR-070 (Self-Supervised Pretraining), ADR-024 (Contrastive CSI Embedding / AETHER), ADR-069 (Cognitum Seed CSI Pipeline) ## Context The WiFi-DensePose project needs a neural architecture that can convert raw CSI amplitude data into 17-keypoint COCO pose estimates. The existing `train-ruvllm.js` pipeline uses a simple 2-layer FC encoder (8 -> 64 -> 128) that produces contrastive embeddings for presence detection but cannot output spatial keypoint coordinates. We evaluated published WiFi-based pose estimation architectures: | Architecture | Params | Input | Key Innovation | Publication | |-------------|--------|-------|---------------|-------------| | **WiFlow** | 4.82M | 540x20 | TCN + AsymConv + Axial Attention | arXiv:2602.08661 | | WiPose | 11.2M | 3x3x30x20 | 3D CNN + heatmap regression | CVPR 2021 | | MetaFi++ | 8.6M | 114x30x20 | Transformer + meta-learning | NeurIPS 2023 | | Person-in-WiFi 3D | 15.3M | Multi-antenna | Deformable attention + 3D | CVPR 2024 | WiFlow is the lightest published SOTA architecture, designed specifically for commercial WiFi hardware. Its key advantage is operating on CSI amplitude only (no phase), which is critical for ESP32-S3 where phase calibration is unreliable. ### Why WiFlow 1. **Lightest SOTA**: 4.82M parameters at original scale; our adaptation targets ~2.5M 2. **Amplitude-only**: Discards phase, which is noisy on consumer hardware 3. **Published architecture**: Fully specified in arXiv:2602.08661, reproducible 4. **Temporal modeling**: TCN with dilated causal convolutions captures motion dynamics 5. **Efficient attention**: Axial attention reduces O(H^2W^2) to O(H^2W + HW^2) 6. **Proven on commercial WiFi**: Validated on commodity Intel 5300 and Atheros hardware ## Decision Implement the WiFlow architecture in pure JavaScript (ruvllm native) with the following adaptations for our ESP32 single TX/RX deployment. ### Architecture Overview ``` CSI Amplitude [128, 20] | Stage 1: TCN (Dilated Causal Conv) dilation = (1, 2, 4, 8), kernel = 7 128 -> 256 -> 192 -> 128 channels | Stage 2: Asymmetric Conv Encoder 1xk conv (k=3), stride (1,2) [1, 128, 20] -> [256, 8, 20] | Stage 3: Axial Self-Attention Width (temporal): 8 heads Height (feature): 8 heads | Decoder: Adaptive Avg Pool + Linear [256, 8, 20] -> pool -> [2048] -> [17, 2] | 17 COCO Keypoints [x, y] in [0, 1] ``` ### Our Adaptation vs Original WiFlow | Aspect | WiFlow Original | Our Adaptation | Reason | |--------|----------------|----------------|--------| | Input channels | 540 (18 links x 30 SC) | 128 (1 TX x 1 RX x 128 SC) | Single ESP32 link | | Time steps | 20 | 20 | Same | | TCN channels | 540 -> 256 -> 128 -> 64 | 128 -> 256 -> 192 -> 128 | Proportional reduction | | Spatial blocks | 4 (stride 2) | 4 (stride 2) | Same | | Attention heads | 8 | 8 | Same | | Parameters | 4.82M | ~1.8M | Fewer input channels | | Input type | Amplitude only | Amplitude only | Same | | Output | 17 x 2 | 17 x 2 | Same | ### Parameter Budget Breakdown | Stage | Parameters | % of Total | |-------|-----------|------------| | TCN (4 blocks, k=7, d=1,2,4,8) | ~969K | 54% | | Asymmetric Conv (4 blocks, 1x3, stride 2) | ~174K | 10% | | Axial Attention (width + height, 8 heads) | ~592K | 33% | | Pose Decoder (pool + linear -> 17x2) | ~70K | 4% | | **Total** | **~1.8M** | **100%** | ### Loss Function ``` L = L_H + 0.2 * L_B L_H = SmoothL1(predicted, target, beta=0.1) L_B = (1/14) * sum_b (bone_length_b - prior_b)^2 ``` 14 bone connections enforce anatomical constraints: - Nose-eye (x2): 0.06 - Eye-ear (x2): 0.06 - Shoulder-elbow (x2): 0.15 - Elbow-wrist (x2): 0.13 - Shoulder-hip (x2): 0.26 - Hip-knee (x2): 0.25 - Knee-ankle (x2): 0.25 - Shoulder width: 0.20 All lengths normalized to person height. ### Training Strategy (Camera-Free Pipeline) Since we have no ground-truth pose labels from cameras, training proceeds in three phases: #### Phase 1: Contrastive Pretraining - Temporal triplets: adjacent windows are positive pairs, distant windows are negative - Cross-node triplets: same-time windows from different ESP32 nodes are positive - Uses ruvllm `ContrastiveTrainer` with triplet + InfoNCE loss - Learns a representation where similar CSI states cluster together #### Phase 2: Pose Proxy Training - Generate coarse pose proxies from vitals data: - Person detected (presence > 0.3): place standing skeleton at center - High motion: perturb limb positions proportional to motion energy - Breathing: add micro-oscillation to torso keypoints - Train with SmoothL1 + bone constraint loss - Confidence-weighted updates (higher presence = stronger gradient) #### Phase 3: Self-Refinement (Future) - Multi-node consistency: same person seen from different nodes should produce consistent pose after geometric transform - Temporal smoothness: adjacent frames should produce similar poses - Bone constraint tightening: gradually reduce tolerance ### Integration with Existing Pipeline ``` train-ruvllm.js (ADR-071) train-wiflow.js (ADR-072) | | | 8-dim features | 128-dim raw CSI amplitude | -> 128-dim embedding | -> 17x2 keypoint coordinates | -> presence/activity/vitals | -> bone-constrained pose | | +-- ContrastiveTrainer -----+------+ +-- TrainingPipeline -------+------+ +-- LoRA per-node ----------+------+ +-- TurboQuant quantize ----+------+ +-- SafeTensors export -----+------+ ``` Both pipelines share the ruvllm infrastructure; WiFlow adds the deeper architecture for direct pose regression while the simple encoder handles embedding tasks. ### Performance Targets | Metric | Target | Notes | |--------|--------|-------| | PCK@20 | > 80% | On lab data with 2+ nodes | | Forward latency | < 50ms | Pi Zero 2W at INT8 | | Model size (INT8) | < 2 MB | TurboQuant | | Bone violation rate | < 10% | 50% tolerance | | Temporal jitter | < 3cm | Exponential smoothing | ### Risk Assessment | Risk | Severity | Mitigation | |------|----------|------------| | Single TX/RX has less spatial info than 18 links | High | 2-node multi-static compensates; cross-node fusion from ADR-029 | | Camera-free labels are coarse | Medium | Bone constraints enforce anatomy; contrastive pretrain provides structure | | Pure JS too slow for real-time | Medium | INT8 quantization; axial attention is O(H^2W+HW^2) not O(H^2W^2) | | Overfitting with ~5K frames | Medium | Temporal augmentation + noise + cross-node interpolation | | Phase not available (amplitude-only) | Low | WiFlow was designed amplitude-only; not a limitation | ## Consequences ### Positive - Proven SOTA architecture adapted to our hardware constraints - Pure JavaScript implementation runs everywhere ruvllm runs (Node.js, browser WASM) - Bone constraints enforce physically plausible outputs even with noisy inputs - Shares training infrastructure with existing ruvllm pipeline - Modular: each stage (TCN, AsymConv, Axial, Decoder) is independently testable ### Negative - ~1.8M parameters is 193x larger than simple CsiEncoder (9,344 params) - Forward pass is slower (~50ms vs <1ms for simple encoder) - Camera-free training will produce lower accuracy than supervised WiFlow - No ground-truth PCK evaluation possible without camera labels - Axial attention is O(N^2) within each axis, limiting scalability ### Neutral - FLOPs dominated by TCN (~48%) due to dilated convolutions - INT8 quantization brings model to ~1.7MB, viable for edge deployment - Architecture is fixed (no NAS); future work could explore lighter variants ## Implementation ### Files Created | File | Purpose | |------|---------| | `scripts/wiflow-model.js` | WiFlow architecture (all stages, loss, metrics) | | `scripts/train-wiflow.js` | Training pipeline (contrastive + pose proxy + LoRA + quant) | | `scripts/benchmark-wiflow.js` | Benchmarking (latency, params, FLOPs, memory, quality) | | `docs/adr/ADR-072-wiflow-architecture.md` | This document | ### Usage ```bash # Train on collected data node scripts/train-wiflow.js --data data/recordings/pretrain-*.csi.jsonl # Train with more epochs and custom output node scripts/train-wiflow.js --data data/recordings/*.csi.jsonl --epochs 50 --output models/wiflow-v2 # Contrastive pretraining only (no labels needed) node scripts/train-wiflow.js --data data/recordings/*.csi.jsonl --contrastive-only # Benchmark node scripts/benchmark-wiflow.js # Benchmark with trained model node scripts/benchmark-wiflow.js --model models/wiflow-v1 ``` ### Dependencies - ruvllm (vendored at `vendor/ruvector/npm/packages/ruvllm/src/`) - `ContrastiveTrainer`, `tripletLoss`, `infoNCELoss`, `computeGradient` - `TrainingPipeline` - `LoraAdapter`, `LoraManager` - `EwcManager` - `ModelExporter`, `SafeTensorsWriter` - No external ML frameworks (no PyTorch, no TensorFlow, no ONNX Runtime) ## References - WiFlow: arXiv:2602.08661 - COCO Keypoints: https://cocodataset.org/#keypoints-2020 - Axial Attention: Wang et al., "Axial-DeepLab", ECCV 2020 - TCN: Bai et al., "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling", 2018