--- name: livekit-omnimodal description: LiveKit omni-modal continuous coaching with stick-breaking color selection, version: 1.0.0 --- # LiveKit Omni-Modal Coaching ## Overview Real-time multi-modal coaching via LiveKit with: - **Continuous listening**: Always-on voice input from participants - **Continuous coaching**: Persistent guidance via "The Queen" voice persona - **Stick-breaking modality selection**: Poisson-Dirichlet weights determine which modality gets attention - **Dynamic sufficiency gating**: ε-machine prevents action without verified skills - **Symbolic expression output**: All observations become s-expressions for categorical processing ## Architecture ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ OMNI-MODAL LIVEKIT COACHING SYSTEM │ └─────────────────────────────────────────────────────────────────────────────┘ ┌─────────────────────┐ │ LiveKit Room │ │ (WebRTC SFU) │ └──────────┬──────────┘ │ ┌─────────────────────────┼─────────────────────────┐ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────────┐ ┌─────────────────┐ │ Audio Stream │ │ Video Stream │ │ Data Track │ │ (continuous) │ │ (screenshare) │ │ (CRDT sync) │ └────────┬────────┘ └──────────┬──────────┘ └────────┬────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ STICK-BREAKING MODALITY SELECTOR │ │ │ │ ├────────────────┤←────────────┤←────────────────────────────────────────┤│ │ w₁ = 0.45 w₂ = 0.30 w₃ = 0.25 │ │ (audio) (video) (data) │ │ SELECTED fallback fallback │ │ │ │ Max fraction color: #E12A4E (audio segment wins) │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ DYNAMIC SUFFICIENCY GATE │ │ │ │ Task: "process audio for coaching feedback" │ │ Causal State: (domain=audio, operation=transcribe, tools=(whisper,)) │ │ │ │ Required Skills: │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ say-narration│ │ signal-msg │ │ whitehole │ │ │ │ (-1) │ │ (0) │ │ (+1) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ ε-Machine: PROCEED (coverage=1.0, missing=0) │ └─────────────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────────┐ │ THE QUEEN'S VOICE OUTPUT │ │ │ │ Voice: Serena (Premium) - English UK - "Bertha Swirles" persona │ │ Trit: Computed from stick-breaking max-fraction color │ │ │ │ Output: S-expression for categorical processing: │ │ │ │ (coaching-event │ │ :timestamp 1735689600 │ │ :modality :audio │ │ :weight 0.45 │ │ :color "#E12A4E" │ │ :trit +1 │ │ :observation "participant mentioned confusion about types" │ │ :guidance "Consider explaining the relationship...") │ └─────────────────────────────────────────────────────────────────────────────┘ ``` ## GF(3) Conservation ``` say-narration (-1) ⊗ livekit-omnimodal (0) ⊗ whitehole-audio (+1) = 0 ✓ ``` | Role | Skill | Function | |------|-------|----------| | **MINUS** (-1) | say-narration | Queen voice output (constraints) | | **ERGODIC** (0) | livekit-omnimodal | **THIS SKILL** - coordinates modalities | | **PLUS** (+1) | whitehole-audio | Audio loopback routing (generation) | ## The Queen's Voice "The Queen" is **Serena (Premium)** - a British English voice representing Bertha Swirles (quantum physicist). She provides continuous coaching guidance with gravitas and clarity. ```bash # The Queen speaks say -v "Serena (Premium)" "I observe you're struggling with the type system. Consider that types are propositions and programs are proofs." ``` ## Stick-Breaking Modality Selection Each incoming modality stream is assigned a segment weight via Poisson-Dirichlet: ```julia using Gay.WorldStickBreaking # Each modality gets a stick segment modalities = [:audio, :video, :screenshare, :data, :chat] pd = world_stick_breaking(alpha=1.0, n_segments=length(modalities), seed=session_seed) # Select dominant modality selected = world_max_fraction_color(pd) # => (color="#E12A4E", weight=0.45, index=1, trit=1) # Audio wins with 45% of attention weight dominant_modality = modalities[selected.index] # :audio ``` ## Dynamic Sufficiency Integration Before any coaching action, verify skill coverage via ε-machine: ```python from sufficiency import EpsilonMachine, Action, CoverageResult # Create action representing coaching intent action = Action( operation="coach", domain="audio", language="natural", tool="whisper" ) # Check sufficiency epsilon_machine = EpsilonMachine() state = epsilon_machine.infer_state(action) coverage = epsilon_machine.check_coverage(action, loaded_skills) if coverage.is_sufficient: # Proceed with coaching emit_sexp(coaching_event) else: # Load missing skills first for skill in coverage.missing: load_skill(skill) ``` ## S-Expression Output Format All observations and coaching events are emitted as s-expressions for categorical processing: ```lisp ;; Coaching event structure (coaching-event :id "CE-2026-01-01-001" :timestamp 1735689600 :session-id "room-xyz" ;; Modality selection (from stick-breaking) :modality :audio :weight 0.45 :color "#E12A4E" :trit +1 ;; Dynamic sufficiency result :causal-state (audio transcribe (whisper)) :coverage 1.0 :sufficient t ;; Observation from modality :observation "participant expressed confusion about monads" :observation-embedding # ;; Queen's guidance :guidance "A monad is simply a monoid in the category of endofunctors." :voice "Serena (Premium)" :confidence 0.92) ;; GF(3) conservation record (gf3-triplet :minus (say-narration -1) :ergodic (livekit-omnimodal 0) :plus (whitehole-audio +1) :sum 0 :conserved t) ``` ## Required Skills (Dependency Analysis) ### Currently Have ✓ | Skill | Trit | Status | |-------|------|--------| | say-narration | -1 | ✓ Installed | | whitehole-audio | +1 | ✓ Installed | | dynamic-sufficiency | -1 | ✓ Installed | | gay-mcp | +1 | ✓ Installed | | signal-messaging | 0 | ✓ Installed | ### Skills to Acquire ✗ | Skill | Trit | Purpose | Priority | |-------|------|---------|----------| | **whisper-transcribe** | 0 | Real-time audio→text | HIGH | | **livekit-spectral** | +1 | WebRTC + spectral gap walks | HIGH | | **vision-llm** | 0 | Screenshare understanding | MEDIUM | | **crdt-livekit** | -1 | Data track synchronization | MEDIUM | | **prosody-analyzer** | +1 | Voice emotion/tone analysis | LOW | ### Skill Gap S-Expression ```lisp (skill-gap-analysis :task "livekit-omnimodal-coaching" :have (say-narration whitehole-audio dynamic-sufficiency gay-mcp) :need (whisper-transcribe livekit-spectral vision-llm crdt-livekit) :coverage 0.55 :sufficient nil :action :load-skills :priority-order (whisper-transcribe livekit-spectral vision-llm crdt-livekit prosody-analyzer)) ``` ## Usage ### Start Coaching Session ```python import asyncio from livekit import api, rtc from livekit_omnimodal import OmnimodalCoach async def main(): # Connect to LiveKit room room = rtc.Room() await room.connect(LIVEKIT_URL, token) # Initialize coach with Queen voice coach = OmnimodalCoach( room=room, voice="Serena (Premium)", stick_alpha=1.0, # Poisson-Dirichlet concentration sufficiency_threshold=0.8 ) # Start continuous listening + coaching await coach.start() # Coach emits s-expressions for each observation async for sexp in coach.events(): print(sexp) # (coaching-event :modality :audio :guidance "...") asyncio.run(main()) ``` ### Emit S-Expression ```python def emit_coaching_sexp(event: CoachingEvent) -> str: """Convert coaching event to s-expression.""" return f"""(coaching-event :id "{event.id}" :timestamp {event.timestamp} :modality :{event.modality} :weight {event.weight:.3f} :color "{event.color}" :trit {event.trit:+d} :observation "{event.observation}" :guidance "{event.guidance}" :voice "{event.voice}")""" ``` ## Local Sand / Italian Woman Mystery The "Italian woman always talking about local sand" is **Emma (Enhanced)** or **Federica (Enhanced)** from the say-narration skill. These Italian voices speaking English are used for agent announcements. **Resolution**: The Queen (Serena Premium) replaces Italian voices for coaching output. To stop Italian voices: ```bash # Kill any running say processes pkill say # Check which skills trigger announcements grep -r "say -v" ~/.claude/skills/*/SKILL.md ``` The "local sand" might be mishearing "locale" + "sans" (French) from multi-locale voice announcements. ## Files | File | Purpose | |------|---------| | `livekit_omnimodal.py` | Main coaching implementation | | `stick_modality.py` | Modality selection via stick-breaking | | `sexp_emitter.py` | S-expression output | | `queen_voice.py` | Voice persona configuration | ## Related Skills - `say-narration` - Voice output personas - `whitehole-audio` - Audio routing - `dynamic-sufficiency` - ε-machine gating - `gay-mcp` - Color generation - `iroh-p2p` - P2P data sync - `signal-messaging` - Fallback messaging