---
name: podcast-edit
description: Edit podcast audio — trim pre/post-show chat, remove filler words, cut silences, and enhance audio quality. Use when the user asks to edit a podcast, clean up audio, remove fillers, trim a recording, or improve voice quality.
user_invocable: true
---

# Podcast Edit Skill

Process raw podcast/meeting recordings into polished podcast episodes.

## Capabilities

1. **Smart trimming** — Find where the actual podcast starts/ends by transcribing and detecting intros/outros
2. **Filler word removal** — Remove verbal tics: 嗯, 呃, 啊, 哦, 对对对, um, uh, etc.
3. **Silence trimming** — Cut long dead air (>2s) down to natural pauses (~0.6s)
4. **Audio enhancement** — Noise reduction, EQ, multi-speaker volume balancing, loudness normalization to podcast standard (−16 LUFS)

## Prerequisites

- `ffmpeg` and `ffprobe` installed
- `OPENAI_API_KEY` in environment (for Whisper API transcription)
- Python 3 with stdlib only (no extra deps for the helper script)

## Workflow

### Step 1: Inspect the audio file

```bash
ffprobe -v quiet -print_format json -show_format -show_streams "INPUT_FILE"
```

Note: duration, sample rate, channels, codec, bitrate.

### Step 2: Find podcast start/end (if user says to trim front/back)

Split into 5-minute chunks and transcribe via OpenAI Whisper API with segment-level timestamps:

```bash
# Extract chunk
ffmpeg -y -i "INPUT_FILE" -ss OFFSET -t 300 -ar 16000 -ac 1 /tmp/chunk_OFFSET.mp3

# Transcribe
curl -s https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file="@/tmp/chunk_OFFSET.mp3" \
  -F model="whisper-1" \
  -F response_format="verbose_json" \
  -F language="LANG" \
  -F 'timestamp_granularities[]=segment' > /tmp/transcript_OFFSET.json
```

Scan transcriptions for:
- **Start markers**: "welcome", "hello everyone", "大家好", "欢迎", intro music, first substantive topic sentence
- **End markers**: "see you next time", "bye", "下期见", "感谢收听", followed by post-show chat

Do an initial trim with `-ss START -to END` and `-c copy` (no re-encode) to create a working file.

### Step 3: Remove filler words

Split the trimmed file into 5-minute chunks and transcribe each with **word-level timestamps**:

```bash
# Extract chunks
for i in $(seq 0 300 DURATION); do
  ffmpeg -y -i "TRIMMED_FILE" -ss $i -t 300 -ar 16000 -ac 1 /tmp/wchunk_${i}.mp3
done

# Transcribe each chunk (can run in parallel)
for i in $(seq 0 300 DURATION); do
  curl -s https://api.openai.com/v1/audio/transcriptions \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -F file="@/tmp/wchunk_${i}.mp3" \
    -F model="whisper-1" \
    -F response_format="verbose_json" \
    -F language="LANG" \
    -F 'timestamp_granularities[]=word' \
    -F 'timestamp_granularities[]=segment' > /tmp/wtranscript_${i}.json &
done
wait
```

Then run the filler removal script that ships with this skill:

```bash
python3 ./filler_removal.py \
  --total-duration DURATION \
  --end-at END_TIMESTAMP \
  --cut START1:END1 --cut START2:END2 \
  --chunk-offsets 0,300,600,900,...
```

**Arguments:**
- `--total-duration`: Duration of the trimmed input file in seconds (required)
- `--end-at`: Cut everything after this timestamp (e.g., post-show chat start)
- `--cut START:END`: Cut a specific range. Can be repeated.
- `--chunk-offsets`: Comma-separated chunk offsets (default: auto 0,300,600,…)

The script outputs `/tmp/ffmpeg_filter.txt` with an `atrim+concat` filter.

Apply the filter in two passes:

```bash
# Step A: Cut fillers → intermediate WAV (avoids re-encoding artifacts)
ffmpeg -y -i "TRIMMED_FILE" \
  -filter_complex_script /tmp/ffmpeg_filter.txt \
  -map '[out]' -c:a pcm_s16le -ar 44100 /tmp/podcast_cut.wav

# Step B: Enhance audio → final MP3
ffmpeg -y -i /tmp/podcast_cut.wav \
  -af "ENHANCEMENT_CHAIN" \
  -c:a libmp3lame -b:a 192k "OUTPUT_FILE"
```

**Limitations:** Whisper word-level timestamps for Chinese can miss fillers that are blended into adjacent speech. The script catches standalone fillers reliably but may miss ~10–20% of embedded ones.

### Step 4: Audio enhancement filter chain

**Default chain (guest-friendly — handles multi-speaker volume imbalance).** The biggest mistake in past runs is using a noise gate (`agate`) that silences the quieter guest entirely. Never add `agate` back to the default chain.

```
highpass=f=80,                                    # Remove room rumble
lowpass=f=12000,                                  # Remove hiss (use 7500 for 16kHz sources)
afftdn=nf=-25:nr=8:nt=w,                         # Gentle FFT noise reduction
equalizer=f=180:t=q:w=1.5:g=-2,                  # Cut mud
equalizer=f=2500:t=q:w=1.2:g=3,                  # Boost presence
equalizer=f=4500:t=q:w=1.5:g=1.5,                # Boost clarity
dynaudnorm=f=200:g=5:p=0.95:m=5:s=0,             # Rolling-window normalization — lifts the quieter speaker independently
acompressor=threshold=-20dB:ratio=2:attack=5:release=200:makeup=1,  # Gentle glue
loudnorm=I=-16:TP=-1.5:LRA=13                    # Podcast standard loudness
```

**Why `dynaudnorm` is the star:** it normalizes in 200 ms rolling windows, so when the guest is speaking, that window gets lifted independently of the host's louder windows. Order matters — run `dynaudnorm` BEFORE `acompressor` so the compressor sees a balanced signal.

**Never add these to the default chain:**
- `agate` (noise gate) — cuts off any speaker quieter than the threshold; kills the guest.
- Heavy compression (ratio >3:1, makeup >2 dB) — flattens dynamics and makes the guest sound pumped.
- Narrow LRA (<12) in `loudnorm` — crushes natural speech dynamics.

**Adjust lowpass based on source sample rate:**
- 16kHz source → `lowpass=7500`
- 44.1kHz+ source → `lowpass=12000` (or skip)

**Verify guest audibility after rendering:** run `ffmpeg -i OUTPUT -af "ebur128=peak=true" -f null -` and check `I:` is near −16 LUFS and `LRA:` is 4–6 LU (tighter LRA is fine because `dynaudnorm` did per-window balancing first). If the output sounds like the guest was cut, suspect a gate or aggressive compressor crept back in.

### Step 5: Verify output

```bash
ls -lh "OUTPUT_FILE"
ffprobe -v quiet -show_entries format=duration -of csv=p=0 "OUTPUT_FILE"
```

Report: duration, file size, what was removed (filler count, silence count, time saved).

## Output conventions

- Format: MP3, 192 kbps, mono (unless source is stereo with separate speakers per channel)
- Loudness: −16 LUFS (podcast standard)
- Always two-pass: cut to WAV first, then enhance to MP3

## Show notes — bilingual writing (if applicable)

If the host is producing bilingual Chinese/English show notes, **the Chinese section must be written in actual Chinese** — not Chinese grammar with English verbs and nouns sprinkled in. Code-switching like "close 了一个 deal", "build 出来的 agent", or "PR 不是 buy 来的" reads like a draft and is the #1 mistake to avoid.

### Translation rules

Translate these common startup/tech English loanwords into Chinese:

- close deal → 拿下订单 / 成交 / 签下
- build (a product) → 搭建 / 做出 / 打造
- integration → 集成
- view (video/page views) → 播放 / 浏览
- stack (tech stack) → 体系 / 技术栈
- category leader → 品类领导者
- front-end / front end (product sense) → 外壳 / 前端
- success story → 客户案例 / 成功故事
- SMB → 中小企业
- Enterprise (segment) → 大型企业 / 企业级
- aha moment → 顿悟时刻
- onboarding → 上手 / 入门
- retention → 留存
- churn → 流失
- pipeline → 销售漏斗 / 业务线

### What to KEEP in English inside Chinese text

- **Brand and product names** — company / product / person names stay as-is
- **Very common startup acronyms** — CEO, CTO, CMO, PMF, ARR, MRR, PR, AI, AI Agent, SaaS, API
- **Currency with numeric prefix** — `$20K`, `$200K`, or `200 美金` (either form is fine when paired with a number)

### Before finalizing

Re-read the Chinese section as a Chinese reader. If any sentence feels like it was half-translated — e.g., contains "build", "close", "deal", "view", "stack", "leader" as standalone English words — rewrite those words in Chinese. The only English that should survive a re-read is brand names and the acronyms above.

## Name verification (CRITICAL)

Whisper frequently mangles company names, product names, and personal names. Before generating show notes or any output that includes names and links:

1. **After transcription, extract all proper nouns** — company names, product names, personal names, URLs mentioned.
2. **Ask the user to confirm/correct them** — Whisper hears similar-sounding but wrong tokens for brand names.
3. **Never guess URLs from transcribed names** — a name that sounds like "Acme" could be `acme.com`, `acmehq.com`, or something else entirely. Always ask.
4. **Use confirmed names consistently** in show notes, titles, episode metadata, and all outputs.

This is especially important when generating backlinks or social posts — a misspelled domain is a wasted link.

## Show notes structure (recommended)

Two separate sections — Chinese first, then English (or whichever languages the show targets). Do NOT interleave or put them side-by-side.

**Heading rule:** only use H2 (`##`). Avoid H3 or deeper — flatten all sub-sections to H2.

**Timestamp format:** always `MM:SS` with leading zeros (e.g., `08:25`, `00:00`, `42:10`). Never `0:00` or `1:05`.

```markdown
EP{NNN}: {Episode title}

---

## 中文

**嘉宾：** {中文姓名 English Name}, {中文职位} {公司} (URL)

## 简介
{完整中文段落}

## 时间轴
- 00:00 — {中文描述}
- 08:25 — {中文描述}

## 核心要点
- {中文要点}

## 相关链接
- {品牌名}：{URL}

---

## English

**Guest:** {English Name}, {Title} at {Company} (URL)

## Summary
{Full English paragraph}

## Timestamps
- 00:00 — {English description}
- 08:25 — {English description}

## Key Takeaways
- {English takeaway}

## Links
- {Brand}: {URL}
```

**Why two sections instead of bilingual bullets:** Chinese readers want clean Chinese prose, English readers want clean English prose. Alternating "中文 / English" on every bullet makes both halves harder to read. Write each section as if it were the only one.

## Quick trim (no filler removal)

If the user just wants a simple trim (e.g., "cut the first 3s"):

```bash
ffmpeg -y -i "INPUT" -ss 3 -c copy "OUTPUT"
```

Use `-c copy` for instant lossless trim when no audio processing is needed.