# Multi-cam Audio Sync Status: **Partial** (VS-27). The **sync tool** is shipped — `tools/sync-multicam.mjs` (ffmpeg I/O) + `tools/multicam.mjs` (pure DSP + manifest math, unit-tested to 100%). Angle-switching is resolved by the pure `resolveAngleCuts`; wiring it through the skill + editor handoff and emitting a true FCPXML multicam asset are **deferred** (see [Deferred](#6-deferred--follow-ups)). This is the concrete, audio-sync half of the broader [`multicam.md`](multicam.md) design. A multicam group is a labeled subset of the [source pool](multiple-sources.md) that is **time-aligned** so a cut can switch angles over a shared timeline. ## 1. Purpose Take a set of clips that cover **one event from several cameras/recorders** and **time-align them by their audio**, emitting a group manifest (`multicam.json`) with a per-member **offset + confidence**. The best audio is often a **separate recording** (a field recorder / external mic), and the cameras' frame rates frequently **don't match** — both are first-class here. ## 2. Technique (validated by deep research) The VS-19/VS-27 ask was to "do deep research… use 3rd-party tools if needed." That research (fan-out web search + adversarial verification) **confirmed** the design's audio-cross-correlation approach and pinned down the specifics below. Key sources are cited in [§7](#7-research-findings--citations). - **FFT cross-correlation of a conditioned mono signal.** Downmix each clip to mono and resample to a common rate (ffmpeg), then cross-correlate in the frequency domain via the convolution theorem — `IFFT(FFT(a)·conj(FFT(b)))`, `O(N log N)`. The argmax lag, divided by the sample rate, is the offset in **seconds**. This is the same primitive ffmpeg `axcorrelate` and `scipy.signal.correlate(method="fft")` use; doing it ourselves yields a single global offset + confidence instead of `axcorrelate`'s per-window time series. - **Pure-JS core, ffmpeg only for I/O.** The correlation math is small and language-agnostic, so it lives in pure, unit-tested JS (`tools/multicam.mjs`). ffmpeg is used only to decode/downmix/resample the audio (and, if drift correction is ever added, to retime). No third-party sync binary is required. - **Conditioning + method.** Mono downmix + resample is universal. The correlation **feature** is tunable: the default **log-energy-style envelope** (rectify + box-smooth, then mean-remove) is robust to per-mic gain/frequency-response differences at low SNR; `--feature raw` uses the waveform directly for maximum precision on clean audio; `--feature phat` runs **GCC-PHAT** — the cross-power spectrum is phase-whitened (each bin divided by its magnitude), giving a much sharper, more noise-immune peak for very low SNR (the textbook Knapp & Carter method). - **Sub-sample precision.** The integer peak is refined by **parabolic interpolation** of the three correlation samples straddling it, so offsets are accurate below one sample at the analysis rate (matters for tight lip-sync). On by default; `--no-interpolate` falls back to integer-sample offsets. ## 3. Requirements - **R-MCS1 Grouping.** The user names ≥2 clips as a multicam group (`--group-id`). `propose-groups` also **suggests** groups from a source pool (`sources.json`) by containing folder, overlapping recording windows (file creation timestamps + duration), or shared filename pattern — the pure heuristics live in [`multicam-groups.mjs`](../tools/multicam-groups.mjs). The skill shows the proposals for confirmation, then runs sync per group. - **R-MCS2 Audio cross-correlation.** Align members by FFT cross-correlation against a reference, storing a per-member **offset (seconds)** + **confidence** (normalized correlation peak in `[0,1]`, or peak distinctness `1 − second/peak` for GCC-PHAT) + a **peak-to-second-peak ratio**. The offset is refined to **sub-sample precision** (parabolic peak interpolation; `--no-interpolate` to disable), and the correlation can run amplitude (`envelope`/`raw`) or phase-whitened (`phat`, GCC-PHAT) via `--feature`. - **R-MCS3 Confidence gate → manual fallback.** Disposition by confidence: `auto` ≥ `--accept` (0.80), `review` in between, **`unsynced`** < `--reject` (0.50). An `unsynced` member is reported with a re-run hint; `--manual =` supplies the offset by hand (the silent / non-overlapping-audio case). User-supplied offsets are labeled `manual`. - **R-MCS4 Audio-only member = reference + master audio (R-MC3).** A member with no video stream (probed via ffprobe) is treated as the **sync reference** AND the **master audio**. With several audio-only members, the longest is the reference; otherwise the longest member overall anchors the group at offset 0. - **R-MCS5 Seconds, never frames (R-MC4).** All alignment is in seconds via the audio sample clock, so **mismatched / non-integer frame rates** (29.97 vs 30, 59.97 vs 60) need no special handling. Each member keeps its own fps; the group records a **project fps** (default: the highest member fps) to conform to on output. - **R-MCS6 Drift detection + correction (R-MC, "hard problem").** For long takes (longer than `--drift-min`, default 600 s) the offset is measured on a window near the start and near the end of the clip — each matched only against the reference region it is expected to land in (the global offset ± a window), so repetitive audio doesn't lock onto a spurious far match — and a line `offset(t) = slope·t + intercept` is fit. The **drift rate (ppm)** is recorded, a member past `DRIFT_WARN_PPM` (100 ppm) is flagged `driftWarning`, and a **retime correction** is emitted: `rateCorrection = 1 + slope` (the factor to run the member on the reference clock; `driftCorrection`/`atempoChain` give the ffmpeg `atempo` chain) plus `correctedOffsetSeconds` (the start-anchored offset to pair with the retime). **Applying** the retime on export/compositing lands with the editor-handoff wiring (VS-29). - **R-MCS7 Group manifest.** Emit `multicam.json` (`{ groups: [...] }`): per group an `id`, `projectFps`, `referenceId`, `masterAudioId`, and `members` (`id`, `path`, `kind`, `fps`, `durationSeconds`, `offsetSeconds`, `confidence`, `peakRatio`, `sync`, `driftPpm`, `driftWarning`). See [§5](#5-manifest-schema). - **R-MCS8 Angle resolution + flat-timeline handoff.** Given angle **switch points** over the shared timeline, `resolveAngleCuts` produces segments `{ memberId, timelineIn/Out, sourceIn/Out }` (with drift it maps `sourceIn = (timelineIn − correctedOffset) / rate`), and `expandMulticamGroup` wraps them into an [editor-handoff](editor-handoff.md) cut spec: silent video angle-segments over a continuous **master-audio track** (`audioTrack`). The export muxes the master audio under the switching angles (`rebuild.sh`), **applies the drift retime** (a drifting segment is `setpts`-stretched so its source span fills its slot), and writes FCPXML with the audio on a connected lane. - **R-MCS9 True FCPXML multicam asset.** `export-multicam-fcpxml` emits a real FCP **multicam clip** from a group: a ``/`` with one `` per member (each angle's clip at its sync offset, shifted so the earliest is 0), referencing the **original** member media, plus one `` per angle switch selecting the active video angle + the master audio via ``. The user imports it into FCP and re-cuts angles live in the angle viewer. (FCPXML multicam is intricate — validate by importing into FCP; see the manual test plan.) ## 4. CLI ``` sync-multicam [options] --group-id group id (default: "group") --project-fps output fps (default: highest member fps) --sample-rate mono analysis rate (default: 8000) --feature correlation feature (default: envelope; phat = GCC-PHAT phase-whitened, noise-robust) --max-offset max plausible start offset to search (default: 300) --accept <0..1> auto-accept confidence (default: 0.8) --reject <0..1> manual-fallback confidence (default: 0.5) --drift-min estimate drift on clips longer than this (default: 600) --window drift-probe window length (default: 30) --no-interpolate disable sub-sample (parabolic) peak refinement --manual = force a member's offset (silent/non-overlapping audio) --out output path (default: ./multicam.json) ``` Member ids are the disambiguated filename slugs from [`sources.mjs`](../tools/sources.mjs) (`assignSourceIds`), so they match the multi-source manifest. To suggest groups from a whole pool first (R-MCS1): ``` propose-groups [--strategy ] [--gap ] [--json] ``` It prints each proposed group (members + the reason) and a ready-to-run `sync-multicam` command; `auto` prefers overlapping recording windows when the files carry creation timestamps, else folder, else filename pattern. To emit a true FCP multicam clip from a synced group (R-MCS9): ``` export-multicam-fcpxml --width --height [--group ] \ [--switch =]… [--name ] [--total ] [--out ] ``` Each `--switch` sets an angle from that second onward; with none, one span runs on the first video angle. The flat-timeline path (`expandMulticamGroup` → `export-project`) remains the default; this is the advanced, FCP-only multicam asset. ## 5. Manifest schema ```jsonc { "groups": [ { "id": "ceremony", "projectFps": 30, "referenceId": "recorder", // sync anchor (offset 0) "masterAudioId": "recorder", // audio for the cut (audio-only member if any) "members": [ { "id": "recorder", "path": "/…/recorder.wav", "kind": "audio", "fps": null, "durationSeconds": 1800, "offsetSeconds": 0, "confidence": 1, "peakRatio": null, "sync": "reference", "driftPpm": 0, "driftWarning": false }, { "id": "cam-a", "path": "/…/cam-a.mov", "kind": "video", "fps": 29.97, "durationSeconds": 1795, "offsetSeconds": 2.5, // cam-a started 2.5 s after the reference "confidence": 0.92, "peakRatio": 16.1, "sync": "auto", "driftPpm": 12, "driftWarning": false, "rateCorrection": 1.000012, // retime factor (1 = none) for the ref clock "correctedOffsetSeconds": 2.49 // start-anchored offset to use WITH the retime } ] } ] } ``` **Offset convention.** `offsetSeconds` is where the member's first sample sits on the shared (reference) timeline: `group_time = member_local_time + offsetSeconds`. Positive ⇒ the member started **later** than the reference. ## 6. Status Shipped end to end: the **sync tool** + manifest (VS-27); **sub-sample precision** + **GCC-PHAT** (`--feature phat`) (VS-32); **automatic group proposal** (`propose-groups`, VS-31); **drift correction** computed + emitted (`rateCorrection` / `correctedOffsetSeconds`, VS-30) **and applied on export** (VS-33); **angle switching → synced flat-timeline export** with a continuous master-audio track + FCPXML connected audio (VS-29); and a **true FCPXML `` multicam asset** (`export-multicam-fcpxml`, VS-33). **Caveat:** the multicam FCPXML asset is generated to the documented FCPXML 1.10 schema but has not been round-trip-validated against a real Final Cut Pro import in this environment — that is a manual step (see the manual test plan). If FCP rejects any element, that is a bug to file. ## 7. Research findings + citations The deep research compared `axcorrelate`, FFT cross-correlation (scipy/numpy), GCC-PHAT, and dedicated tools (PluralEyes, audalign, BBC `audio-offset-finder`), adversarially verifying each claim. Highlights that shaped this design: - **FFT cross-correlation is the field-standard primitive** (`O(N log N)` via the convolution theorem) and a **pure-JS implementation on downsampled mono is viable** — the algorithm is identical to `axcorrelate` / `scipy`; ffmpeg is still needed for decode/downmix/resample and any retime. ([SciPy docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate.html), [Apple US Patent 8,621,355](https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/8621355), [GCC-PHAT](https://github.com/MinAungThu/GCC-PHAT)) - **Confidence gating is how every tool handles silent / non-overlapping audio** — none auto-solve it; a low normalized-peak / z-score falls back to manual. Concrete gates: normalized peak **>0.80 accept / <0.50 unreliable** (sync-offset-tool); BBC z-score **>10 / <5** needs manual check. We use the normalized-peak 0.80/0.50 gate. ([gmipf/sync-offset-tool](https://github.com/gmipf/sync-offset-tool), [BBC audio-offset-finder](https://pypi.org/project/audio-offset-finder/)) - **Seconds-based alignment sidesteps variable/non-integer fps** — surveyed tools do no frame-rate logic; the lag is a sample index ÷ sample rate. ([audalign](https://github.com/benfmiller/audalign/)) - **Drift over long takes needs more than one offset** — a linear `slope·t + intercept` fit with the **midpoint offset** makes the residual symmetric; a single midpoint offset suffices for angle-switching cuts, while tight lip-sync past ~30 min needs a retime. PluralEyes does end-to-end speed-matching ("drift corrected" output). ([Apple patent](https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/8621355), [PluralEyes/DesignTrek](http://www.designtrek.com/quickly-sync-audio-and-video-and-correct-drift-with-pluraleyes)) - **Envelope vs raw waveform** — an amplitude/log-energy envelope is more robust at low SNR (secondary-camera mics); this was the one split-vote claim (a PHAT-whitened raw signal is also low-SNR robust), so the feature is a tunable (`--feature`) defaulting to envelope.