# ADR-079: Camera Ground-Truth Training Pipeline

- **Status**: Accepted
- **Date**: 2026-04-06
- **Deciders**: ruv
- **Relates to**: ADR-072 (WiFlow Architecture), ADR-070 (Self-Supervised Pretraining), ADR-071 (ruvllm Training Pipeline), ADR-024 (AETHER Contrastive), ADR-064 (Multimodal Ambient Intelligence), ADR-075 (MinCut Person Separation)

## Context

WiFlow (ADR-072) currently trains without ground-truth pose labels, using proxy poses
generated from presence/motion heuristics. This produces a PCK@20 of only 2.5% — far
below the 30-50% achievable with supervised training. The fundamental bottleneck is the
absence of spatial keypoint labels.

Academic WiFi pose estimation systems (Wi-Pose, Person-in-WiFi 3D, MetaFi++) all train
with synchronized camera ground truth and achieve PCK@20 of 40-85%. They discard the
camera at deployment — the camera is a training-time teacher, not a runtime dependency.

ADR-064 already identified this: *"Record CSI + mmWave while performing signs with a
camera as ground truth, then deploy camera-free."* This ADR specifies the implementation.

### Current Training Pipeline Gap

```
Current:  CSI amplitude → WiFlow → 17 keypoints (proxy-supervised, PCK@20 = 2.5%)
                                    ↑
                            Heuristic proxies:
                            - Standing skeleton when presence > 0.3
                            - Limb perturbation from motion energy
                            - No spatial accuracy
```

### Target Pipeline

```
Training: CSI amplitude ──→ WiFlow ──→ 17 keypoints (camera-supervised, PCK@20 target: 35%+)
                                        ↑
          Laptop camera ──→ MediaPipe ──→ 17 COCO keypoints (ground truth)
                                        (time-synchronized, 30 fps)

Deploy:   CSI amplitude ──→ WiFlow ──→ 17 keypoints (camera-free, trained model only)
```

## Decision

Build a camera ground-truth collection and training pipeline using the laptop webcam
as a teacher signal. The camera is used **only during training data collection** and is
not required at deployment.

### Architecture Overview

```
┌─────────────────────────────────────────────────────────────────┐
│                    Data Collection Phase                         │
│                                                                 │
│  ESP32-S3 nodes ──UDP──→ Sensing Server ──→ CSI frames (.jsonl) │
│                              ↑ time sync                        │
│  Laptop Camera ──→ MediaPipe Pose ──→ Keypoints (.jsonl)        │
│                              ↑                                  │
│                     collect-ground-truth.py                      │
│                     (single orchestrator)                        │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    Training Phase                                │
│                                                                 │
│  Paired dataset: { csi_window[128,20], keypoints[17,2], conf }  │
│         ↓                                                       │
│  train-wiflow-supervised.js                                     │
│    Phase 1: Contrastive pretrain (ADR-072, reuse)               │
│    Phase 2: Supervised keypoint regression (NEW)                │
│    Phase 3: Fine-tune with bone constraints + confidence        │
│         ↓                                                       │
│  WiFlow model (1.8M params) → SafeTensors export                │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    Deployment (camera-free)                      │
│                                                                 │
│  ESP32-S3 CSI → Sensing Server → WiFlow inference → 17 keypoints│
│  (No camera. Trained model runs on CSI input only.)             │
└─────────────────────────────────────────────────────────────────┘
```

### Component 1: `scripts/collect-ground-truth.py`

Single Python script that orchestrates synchronized capture from the laptop camera
and the ESP32 CSI stream.

**Dependencies:** `mediapipe`, `opencv-python`, `requests` (all pip-installable, no GPU)

**Capture flow:**

```python
# Pseudocode
camera = cv2.VideoCapture(0)           # Laptop webcam
sensing_api = "http://localhost:3000"   # Sensing server

# Start CSI recording via existing API
requests.post(f"{sensing_api}/api/v1/recording/start")

while recording:
    frame = camera.read()
    t = time.time_ns()                  # Nanosecond timestamp

    # MediaPipe Pose: 33 landmarks → map to 17 COCO keypoints
    result = mp_pose.process(frame)
    keypoints_17 = map_mediapipe_to_coco(result.pose_landmarks)
    confidence = mean(landmark.visibility for relevant landmarks)

    # Write to ground-truth JSONL (one line per frame)
    write_jsonl({
        "ts_ns": t,
        "keypoints": keypoints_17,      # [[x,y], ...] normalized [0,1]
        "confidence": confidence,        # 0-1, used for loss weighting
        "n_visible": count(visibility > 0.5),
    })

    # Optional: show live preview with skeleton overlay
    if preview:
        draw_skeleton(frame, keypoints_17)
        cv2.imshow("Ground Truth", frame)

# Stop CSI recording
requests.post(f"{sensing_api}/api/v1/recording/stop")
```

**MediaPipe → COCO keypoint mapping:**

| COCO Index | Joint | MediaPipe Index |
|------------|-------|-----------------|
| 0 | Nose | 0 |
| 1 | Left Eye | 2 |
| 2 | Right Eye | 5 |
| 3 | Left Ear | 7 |
| 4 | Right Ear | 8 |
| 5 | Left Shoulder | 11 |
| 6 | Right Shoulder | 12 |
| 7 | Left Elbow | 13 |
| 8 | Right Elbow | 14 |
| 9 | Left Wrist | 15 |
| 10 | Right Wrist | 16 |
| 11 | Left Hip | 23 |
| 12 | Right Hip | 24 |
| 13 | Left Knee | 25 |
| 14 | Right Knee | 26 |
| 15 | Left Ankle | 27 |
| 16 | Right Ankle | 28 |

### Component 2: Time Alignment (`scripts/align-ground-truth.js`)

CSI frames arrive at ~100 Hz with server-side timestamps. Camera keypoints arrive at
~30 fps with client-side timestamps. Alignment is needed because:

1. Camera and sensing server clocks differ (typically < 50ms on LAN)
2. CSI is aggregated into 20-frame windows for WiFlow input
3. Ground-truth keypoints must be averaged over the same window

**Alignment algorithm:**

```
For each CSI window W_i (20 frames, ~200ms at 100Hz):
  t_start = W_i.first_frame.timestamp
  t_end   = W_i.last_frame.timestamp

  # Find all camera keypoints within this time window
  matching_keypoints = [k for k in camera_data if t_start <= k.ts <= t_end]

  if len(matching_keypoints) >= 3:   # At least 3 camera frames per window
    # Average keypoints, weighted by confidence
    avg_keypoints = weighted_mean(matching_keypoints, weights=confidences)
    avg_confidence = mean(confidences)

    paired_dataset.append({
      csi_window: W_i.amplitudes,    # [128, 20] float32
      keypoints: avg_keypoints,       # [17, 2] float32
      confidence: avg_confidence,     # scalar
      n_camera_frames: len(matching_keypoints),
    })
```

**Clock sync strategy:**

- NTP is sufficient (< 20ms error on LAN)
- The 200ms CSI window is 10x larger than typical clock drift
- For tighter sync: use a handclap/jump as a sync marker — visible spike in both
  CSI motion energy and camera skeleton velocity. Auto-detect and align.

**Output:** `data/recordings/paired-{timestamp}.jsonl` — one line per paired sample:
```json
{"csi": [128x20 flat], "kp": [[0.45,0.12], ...], "conf": 0.92, "ts": 1775300000000}
```

### Component 3: Supervised Training (`scripts/train-wiflow-supervised.js`)

Extends the existing `train-ruvllm.js` pipeline with a supervised phase.

**Phase 1: Contrastive Pretrain (reuse ADR-072)**
- Same as existing: temporal + cross-node triplets
- Learns CSI representation without labels
- 50 epochs, ~5 min on laptop

**Phase 2: Supervised Keypoint Regression (NEW)**
- Load paired dataset from Component 2
- Loss: confidence-weighted SmoothL1 on keypoints

```
L_supervised = (1/N) * sum_i [ conf_i * SmoothL1(pred_i, gt_i, beta=0.05) ]
```

- Only train on samples where `conf > 0.5` (discard frames where MediaPipe lost tracking)
- Learning rate: 1e-4 with cosine decay
- 200 epochs, ~15 min on laptop CPU (1.8M params, no GPU needed)

**Phase 3: Refinement with Bone Constraints**
- Fine-tune with combined loss:

```
L = L_supervised + 0.3 * L_bone + 0.1 * L_temporal

L_bone     = (1/14) * sum_b (bone_len_b - prior_b)^2   # ADR-072 bone priors
L_temporal = SmoothL1(kp_t, kp_{t-1})                   # Temporal smoothness
```

- 50 epochs at lower LR (1e-5)
- Tighten bone constraint weight from 0.3 → 0.5 over epochs

**Phase 4: Quantization + Export**
- Reuse ruvllm TurboQuant: float32 → int8 (4x smaller, ~881 KB)
- Export via SafeTensors for cross-platform deployment
- Validate quantized model PCK@20 within 2% of full-precision

### Component 4: Evaluation Script (`scripts/eval-wiflow.js`)

Measure actual PCK@20 using held-out paired data (20% split).

```
PCK@k = (1/N) * sum_i [ (||pred_i - gt_i|| < k * torso_length) ? 1 : 0 ]
```

**Metrics reported:**

| Metric | Description | Target |
|--------|-------------|--------|
| PCK@20 | % of keypoints within 20% torso length | > 35% |
| PCK@50 | % within 50% torso length | > 60% |
| MPJPE | Mean per-joint position error (pixels) | < 40px |
| Per-joint PCK | Breakdown by joint (wrists are hardest) | Report all 17 |
| Inference latency | Single window prediction time | < 50ms |

### Optimization Strategy

#### O1: Curriculum Learning

Train easy poses first, hard poses later:

| Stage | Epochs | Data Filter | Rationale |
|-------|--------|-------------|-----------|
| 1 | 50 | `conf > 0.9`, standing only | Establish stable skeleton baseline |
| 2 | 50 | `conf > 0.7`, low motion | Add sitting, subtle movements |
| 3 | 50 | `conf > 0.5`, all poses | Full dataset including occlusions |
| 4 | 50 | All data, with augmentation | Robustness via noise injection |

#### O2: Data Augmentation (CSI domain)

Augment CSI windows to increase effective dataset size without collecting more data:

| Augmentation | Implementation | Expected Gain |
|-------------|----------------|---------------|
| Time shift | Roll CSI window by ±2 frames | +30% data |
| Amplitude noise | Gaussian noise, sigma=0.02 | Robustness |
| Subcarrier dropout | Zero 10% of subcarriers randomly | Robustness |
| Temporal flip | Reverse window + reverse keypoint velocity | +100% data |
| Multi-node mix | Swap node CSI, keep same-time keypoints | Cross-node generalization |

#### O3: Knowledge Distillation from MediaPipe

Instead of raw keypoint regression, distill MediaPipe's confidence and heatmap
information:

```
L_distill = KL_div(softmax(wifi_heatmap / T), softmax(camera_heatmap / T))
```

- Temperature T=4 for soft targets (transfers inter-joint relationships)
- WiFlow predicts a 17-channel heatmap [17, H, W] instead of direct [17, 2]
- Argmax for final keypoint extraction
- **Trade-off:** Adds ~200K params for heatmap decoder, but improves spatial precision

#### O4: Active Learning Loop

Identify which poses the model is worst at and collect more data for those:

```
1. Train initial model on first collection session
2. Run inference on new CSI data, compute prediction entropy
3. Flag high-entropy windows (model is uncertain)
4. During next collection, the preview overlay highlights these moments:
   "Hold this pose — model needs more examples"
5. Re-train with augmented dataset
```

Expected: 2-3 active learning iterations reach saturation.

#### O6: Subcarrier Selection (ruvector-solver)

Variance-based top-K subcarrier selection, equivalent to ruvector-solver's sparse
interpolation (114→56). Removes noise/static subcarriers before training:

```
For each subcarrier d in [0, dim):
  variance[d] = mean over samples of temporal_variance(csi[d, :])
Select top-K by variance (K = dim * 0.5)
```

**Validated:** 128 → 56 subcarriers (56% input reduction), proportional model size reduction.

#### O7: Attention-Weighted Subcarriers (ruvector-attention)

Compute per-subcarrier attention weights based on temporal energy correlation with
ground-truth keypoint motion. High-energy subcarriers that covary with skeleton
movement get amplified:

```
For each subcarrier d:
  energy[d] = sum of squared first-differences over time
  weight[d] = softmax(energy, temperature=0.1)
Apply: csi[d, :] *= weight[d] * dim  (mean weight = 1)
```

**Validated:** Top-5 attention subcarriers identified automatically per dataset.

#### O8: Stoer-Wagner MinCut Person Separation (ruvector-mincut / ADR-075)

JS implementation of the Stoer-Wagner algorithm for person separation in CSI, equivalent
to `DynamicPersonMatcher` in `wifi-densepose-train/src/metrics.rs`. Builds a subcarrier
correlation graph and finds the minimum cut to identify person-specific subcarrier clusters:

```
1. Build dim×dim Pearson correlation matrix across subcarriers
2. Run Stoer-Wagner min-cut on correlation graph
3. Partition subcarriers into person-specific groups
4. Train per-partition models for multi-person scenarios
```

**Validated:** Stoer-Wagner executes on 56-dim graph, identifies partition boundaries.

#### O9: Multi-SPSA Gradient Estimation

Average over K=3 random perturbation directions per gradient step. Reduces variance
by sqrt(K) = 1.73x compared to single SPSA, at 3x forward pass cost (net win for
convergence quality):

```
For k in 1..K:
  delta_k = random ±1 per parameter
  grad_k = (loss(w + eps*delta_k) - loss(w - eps*delta_k)) / (2*eps*delta_k)
grad = mean(grad_1, ..., grad_K)
```

#### O10: Mac M4 Pro Training via Tailscale

Training runs on Mac Mini M4 Pro (16-core GPU, ARM NEON SIMD) via Tailscale SSH,
using ruvllm's native Node.js SIMD ops:

| | Windows (CPU) | Mac M4 Pro |
|---|---|---|
| Node.js | v24.12.0 (x86) | v25.9.0 (ARM) |
| SIMD | SSE4/AVX2 | NEON |
| Cores | Consumer laptop | 12P + 4E cores |
| Training | Slow (minutes/epoch) | Fast (seconds/epoch) |

#### O5: Cross-Environment Transfer

Train on one room, deploy in another:

| Strategy | Implementation |
|----------|---------------|
| Room-invariant features | Normalize CSI by running mean/variance |
| LoRA adapters | Train a 4-rank LoRA per room (ADR-071) — 7.3 KB each |
| Few-shot calibration | 2 min of camera data in new room → fine-tune LoRA only |
| AETHER embeddings | Use contrastive room-independent features (ADR-024) as input |

The LoRA approach is most practical: ship a base model + collect 2 min of calibration
data per new room using the laptop camera.

### Data Collection Protocol

Recommended collection sessions per room:

| Session | Duration | Activity | People | Total CSI Frames |
|---------|----------|----------|--------|-----------------|
| 1. Baseline | 5 min | Empty + 1 person entry/exit | 0-1 | 30,000 |
| 2. Standing poses | 5 min | Stand, arms up/down/sides, turn | 1 | 30,000 |
| 3. Sitting | 5 min | Sit, type, lean, stand up/sit down | 1 | 30,000 |
| 4. Walking | 5 min | Walk paths across room | 1 | 30,000 |
| 5. Mixed | 5 min | Varied activities, transitions | 1 | 30,000 |
| 6. Multi-person | 5 min | 2 people, varied activities | 2 | 30,000 |
| **Total** | **30 min** | | | **180,000** |

At 20-frame windows: **9,000 paired training samples** per 30-min session.
With augmentation (O2): **~27,000 effective samples**.

Camera placement: position laptop so the camera has a clear view of the sensing area.
The camera FOV should cover the same space the ESP32 nodes cover.

### File Structure

```
scripts/
  collect-ground-truth.py     # Camera capture + MediaPipe + CSI sync
  align-ground-truth.js       # Time-align CSI windows with camera keypoints
  train-wiflow-supervised.js  # Supervised training pipeline
  eval-wiflow.js              # PCK evaluation on held-out data

data/
  ground-truth/               # Raw camera keypoint captures
    gt-{timestamp}.jsonl
  paired/                     # Aligned CSI + keypoint pairs
    paired-{timestamp}.jsonl

models/
  wiflow-supervised/          # Trained model outputs
    wiflow-v1.safetensors
    wiflow-v1-int8.safetensors
    training-log.json
    eval-report.json
```

### Privacy Considerations

- Camera frames are processed **locally** by MediaPipe — no cloud upload
- Raw video is **never saved** — only extracted keypoint coordinates are stored
- The `.jsonl` ground-truth files contain only `[x,y]` joint coordinates, not images
- The trained model runs on CSI only — no camera data leaves the laptop
- Users can delete `data/ground-truth/` after training; the model is self-contained

## Consequences

### Positive

- **10-20x accuracy improvement**: PCK@20 from 2.5% → 35%+ with real supervision
- **Reuses existing infrastructure**: sensing server recording API, ruvllm training, SafeTensors
- **No new hardware**: laptop webcam + existing ESP32 nodes
- **Privacy preserved at deployment**: camera only needed during 30-min training session
- **Incremental**: can improve with more collection sessions + active learning
- **Distributable**: trained model weights can be shared on HuggingFace (ADR-070)

### Negative

- **Camera placement matters**: must see the same area ESP32 nodes sense
- **Single-room models**: need LoRA calibration per room (2 min + camera)
- **MediaPipe limitations**: occlusion, side views, multiple people reduce keypoint quality
- **Time sync**: NTP drift can misalign frames (mitigated by 200ms windows)

### Risks

| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| MediaPipe keypoints too noisy | Low | Medium | Filter by confidence; MediaPipe is robust indoors |
| Clock drift > 100ms | Low | High | Add handclap sync marker detection |
| Single camera can't see all poses | Medium | Medium | Position camera centrally; collect from 2 angles |
| Model overfits to one room | High | Medium | LoRA adapters + AETHER normalization (O5) |
| Insufficient data (< 5K pairs) | Low | High | Augmentation (O2) + active learning (O4) |

## Implementation Plan

| Phase | Task | Effort | Status |
|-------|------|--------|--------|
| P1 | `collect-ground-truth.py` — camera + MediaPipe capture | 2 hrs | **Done** |
| P2 | `align-ground-truth.js` — time alignment + pairing | 1 hr | **Done** |
| P3 | `train-wiflow-supervised.js` — supervised training | 3 hrs | **Done** |
| P4 | `eval-wiflow.js` — PCK evaluation | 1 hr | **Done** |
| P5 | ruvector optimizations (O6-O9) | 2 hrs | **Done** |
| P6 | Mac M4 Pro training via Tailscale (O10) | 1 hr | **Done** |
| P7 | Data collection session (30 min recording) | 1 hr | Pending |
| P8 | Training + evaluation on real paired data | 30 min | Pending |
| P9 | LoRA cross-room calibration (O5) | 2 hrs | Pending |

## Validated Hardware

| Component | Spec | Validated |
|-----------|------|-----------|
| Mac Mini camera | 1920x1080, 30fps | Yes — 14/17 keypoints, conf 0.94-1.0 |
| MediaPipe PoseLandmarker | v0.10.33 Tasks API, lite model | Yes — via Tailscale SSH |
| Mac M4 Pro GPU | 16-core, Metal 4, NEON SIMD | Yes — Node.js v25.9.0 |
| Tailscale SSH | LAN-accessible Mac, passwordless | Yes |
| ESP32-S3 CSI | 128 subcarriers, 100Hz | Yes — existing recordings |
| Sensing server recording API | `/api/v1/recording/start\|stop` | Yes — existing |

## Baseline Benchmark

Proxy-pose baseline (no camera supervision, standing skeleton heuristic):

```
PCK@10:  11.8%
PCK@20:  35.3%
PCK@50:  94.1%
MPJPE:   0.067
Latency: 0.03ms/sample
```

Per-joint PCK@20: upper body (nose, shoulders, wrists) at 0% — proxy has no spatial
accuracy for these. Camera supervision targets these joints specifically.

## References

- WiFlow: arXiv:2602.08661 — WiFi-based pose estimation with TCN + axial attention
- Wi-Pose (CVPR 2021) — 3D CNN WiFi pose with camera supervision
- Person-in-WiFi 3D (CVPR 2024) — Deformable attention with camera labels
- MediaPipe Pose — Google's real-time 33-landmark body pose estimator
- MetaFi++ (NeurIPS 2023) — Meta-learning cross-modal WiFi sensing