--- name: video-use description: Edit any video by conversation. Transcribe, cut, color grade, generate overlay animations, burn subtitles — for talking heads, montages, tutorials, travel, interviews. No presets, no menus. Ask questions, confirm the plan, execute, iterate, persist. Production-correctness rules are hard; everything else is artistic freedom. --- # Video Use ## Principle 1. **LLM reasons from raw transcript + on-demand visuals.** The only derived artifact that earns its keep is a packed phrase-level transcript (`takes_packed.md`). Everything else — filler tagging, retake detection, shot classification, emphasis scoring — you derive at decision time. 2. **Audio is primary, visuals follow.** Cut candidates come from speech boundaries and silence gaps. Drill into visuals only at decision points. 3. **Ask → confirm → execute → iterate → persist.** Never touch the cut until the user has confirmed the strategy in plain English. 4. **Generalize.** Do not assume what kind of video this is. Look at the material, ask the user, then edit. 5. **Artistic freedom is the default.** Every specific value, preset, font, color, duration, pitch structure, and technique in this document is a *worked example* from one proven video — not a mandate. Read them to understand what's possible and why each worked. Then make your own taste calls based on what the material actually is and what the user actually wants. **The only things you MUST do are in the Hard Rules section below.** Everything else is yours. 6. **Invent freely.** If the material calls for a technique not described here — split-screen, picture-in-picture, lower-third identity cards, reaction cuts, speed ramps, freeze frames, crossfades, match cuts, L-cuts, J-cuts, speed ramps over breath, whatever — build it. The helpers are ffmpeg and PIL. They can do anything the format supports. Do not wait for permission. 7. **Verify your own output before showing it to the user.** If you wouldn't ship it, don't present it. ## Hard Rules (production correctness — non-negotiable) These are the things where deviation produces silent failures or broken output. They are not taste, they are correctness. Memorize them. 1. **Subtitles are applied LAST in the filter chain**, after every overlay. Otherwise overlays hide captions. Silent failure. 2. **Per-segment extract → lossless `-c copy` concat**, not single-pass filtergraph. Otherwise you double-encode every segment when overlays are added. 3. **30ms audio fades at every segment boundary** (`afade=t=in:st=0:d=0.03,afade=t=out:st={dur-0.03}:d=0.03`). Otherwise audible pops at every cut. 4. **Overlays use `setpts=PTS-STARTPTS+T/TB`** to shift the overlay's frame 0 to its window start. Otherwise you see the middle of the animation during the overlay window. 5. **Master SRT uses output-timeline offsets**: `output_time = word.start - segment_start + segment_offset`. Otherwise captions misalign after segment concat. 6. **Never cut inside a word.** Snap every cut edge to a word boundary from the Scribe transcript. 7. **Pad every cut edge.** Working window: 30–200ms. Scribe timestamps drift 50–100ms — padding absorbs the drift. Tighter for fast-paced, looser for cinematic. 8. **Word-level verbatim ASR only.** Never SRT/phrase mode (loses sub-second gap data). Never normalized fillers (loses editorial signal). 9. **Cache transcripts per source.** Never re-transcribe unless the source file itself changed. 10. **Parallel sub-agents for multiple animations.** Never sequential. Spawn N at once via the `Agent` tool; total wall time ≈ slowest one. 11. **Strategy confirmation before execution.** Never touch the cut until the user has approved the plain-English plan. 12. **All session outputs in `/edit/`.** Never write inside the `video-use/` project directory. Everything else in this document is a worked example. Deviate whenever the material calls for it. ## Directory layout The skill lives in `video-use/`. User footage lives wherever they put it. All session outputs go into `/edit/`. ``` / ├── └── edit/ ├── project.md ← memory; appended every session ├── takes_packed.md ← phrase-level transcripts, the LLM's primary reading view ├── edl.json ← cut decisions ├── transcripts/.json ← cached raw Scribe JSON ├── animations/slot_/ ← per-animation source + render + reasoning ├── clips_graded/ ← per-segment extracts with grade + fades ├── master.srt ← output-timeline subtitles ├── downloads/ ← yt-dlp outputs ├── verify/ ← debug frames / timeline PNGs ├── preview.mp4 └── final.mp4 ``` ## Setup First-time install lives in `install.md` (clone, deps, ffmpeg, skill registration, API key). Don't re-run it every session; on cold start just verify: - `ELEVENLABS_API_KEY` resolves — either in the environment or in `.env` at the video-use repo root. If missing, ask the user to paste one and write it to `.env` (never to the user's ``). - `ffmpeg` + `ffprobe` on PATH. - Python deps installed (`uv sync` or `pip install -e .` inside the repo). - `yt-dlp`, `manim`, Remotion installed only on first use. - This skill vendors `skills/manim-video/`. Read its SKILL.md when building a Manim slot. Helpers (`helpers/transcribe.py`, `helpers/render.py`, etc.) live alongside this SKILL.md. Resolve their paths relative to the directory containing this file — the skill is typically symlinked at `~/.claude/skills/video-use/` or `~/.codex/skills/video-use/`. ## Helpers - **`transcribe.py