---
name: transcription
description: Audio/video transcription using OpenAI Whisper. Covers installation, model selection, transcript formats (SRT, VTT, JSON), timing synchronization, and speaker diarization. Use when transcribing media or generating subtitles.
---
plugin: video-editing
updated: 2026-01-20

# Transcription with Whisper

Production-ready patterns for audio/video transcription using OpenAI Whisper.

## System Requirements

### Installation Options

**Option 1: OpenAI Whisper (Python)**
```bash
# macOS/Linux/Windows
pip install openai-whisper

# Verify
whisper --help
```

**Option 2: whisper.cpp (C++ - faster)**
```bash
# macOS
brew install whisper-cpp

# Linux - build from source
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make

# Windows - use pre-built binaries or build with cmake
```

**Option 3: Insanely Fast Whisper (GPU accelerated)**
```bash
pip install insanely-fast-whisper
```

### Model Selection

| Model | Size | VRAM | Accuracy | Speed | Use Case |
|-------|------|------|----------|-------|----------|
| tiny | 39M | ~1GB | Low | Fastest | Quick previews |
| base | 74M | ~1GB | Medium | Fast | Draft transcripts |
| small | 244M | ~2GB | Good | Medium | General use |
| medium | 769M | ~5GB | Better | Slow | Quality transcripts |
| large-v3 | 1550M | ~10GB | Best | Slowest | Final production |

**Recommendation:** Start with `small` for speed/quality balance. Use `large-v3` for final delivery.

## Basic Transcription

### Using OpenAI Whisper

```bash
# Basic transcription (auto-detect language)
whisper audio.mp3 --model small

# Specify language and output format
whisper audio.mp3 --model medium --language en --output_format srt

# Multiple output formats
whisper audio.mp3 --model small --output_format all

# With timestamps and word-level timing
whisper audio.mp3 --model small --word_timestamps True
```

### Using whisper.cpp

```bash
# Download model first
./models/download-ggml-model.sh base.en

# Transcribe
./main -m models/ggml-base.en.bin -f audio.wav -osrt

# With timestamps
./main -m models/ggml-base.en.bin -f audio.wav -ocsv
```

## Output Formats

### SRT (SubRip Subtitle)
```
1
00:00:01,000 --> 00:00:04,500
Hello and welcome to this video.

2
00:00:05,000 --> 00:00:08,200
Today we'll discuss video editing.
```

### VTT (WebVTT)
```
WEBVTT

00:00:01.000 --> 00:00:04.500
Hello and welcome to this video.

00:00:05.000 --> 00:00:08.200
Today we'll discuss video editing.
```

### JSON (with word-level timing)
```json
{
  "text": "Hello and welcome to this video.",
  "segments": [
    {
      "id": 0,
      "start": 1.0,
      "end": 4.5,
      "text": " Hello and welcome to this video.",
      "words": [
        {"word": "Hello", "start": 1.0, "end": 1.3},
        {"word": "and", "start": 1.4, "end": 1.5},
        {"word": "welcome", "start": 1.6, "end": 2.0},
        {"word": "to", "start": 2.1, "end": 2.2},
        {"word": "this", "start": 2.3, "end": 2.5},
        {"word": "video", "start": 2.6, "end": 3.0}
      ]
    }
  ]
}
```

## Audio Extraction for Transcription

Before transcribing video, extract audio in optimal format:

```bash
# Extract audio as WAV (16kHz, mono - optimal for Whisper)
ffmpeg -i video.mp4 -ar 16000 -ac 1 -c:a pcm_s16le audio.wav

# Extract as high-quality WAV for archival
ffmpeg -i video.mp4 -vn -c:a pcm_s16le audio.wav

# Extract as compressed MP3 (smaller, still works)
ffmpeg -i video.mp4 -vn -c:a libmp3lame -q:a 2 audio.mp3
```

## Timing Synchronization

### Convert Whisper JSON to FCP Timing

```python
import json

def whisper_to_fcp_timing(whisper_json_path, fps=24):
    """Convert Whisper JSON output to FCP-compatible timing."""
    with open(whisper_json_path) as f:
        data = json.load(f)

    segments = []
    for seg in data.get("segments", []):
        segments.append({
            "start_time": seg["start"],
            "end_time": seg["end"],
            "start_frame": int(seg["start"] * fps),
            "end_frame": int(seg["end"] * fps),
            "text": seg["text"].strip(),
            "words": seg.get("words", [])
        })

    return segments
```

### Frame-Accurate Timing

```bash
# Get exact frame count and duration
ffprobe -v error -count_frames -select_streams v:0 \
  -show_entries stream=nb_read_frames,duration,r_frame_rate \
  -of json video.mp4
```

## Speaker Diarization

For multi-speaker content, use pyannote.audio:

```bash
pip install pyannote.audio
```

```python
from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1")
diarization = pipeline("audio.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.1f}s - {turn.end:.1f}s: {speaker}")
```

## Batch Processing

```bash
#!/bin/bash
# Transcribe all videos in directory

MODEL="small"
OUTPUT_DIR="transcripts"
mkdir -p "$OUTPUT_DIR"

for video in *.mp4 *.mov *.avi; do
  [[ -f "$video" ]] || continue

  base="${video%.*}"

  # Extract audio
  ffmpeg -i "$video" -ar 16000 -ac 1 -c:a pcm_s16le "/tmp/${base}.wav" -y

  # Transcribe
  whisper "/tmp/${base}.wav" --model "$MODEL" \
    --output_format all \
    --output_dir "$OUTPUT_DIR"

  # Cleanup temp audio
  rm "/tmp/${base}.wav"

  echo "Transcribed: $video"
done
```

## Quality Optimization

### Improve Accuracy

1. **Noise reduction before transcription:**
```bash
ffmpeg -i noisy_audio.wav -af "highpass=f=200,lowpass=f=3000,afftdn=nf=-25" clean_audio.wav
```

2. **Use language hint:**
```bash
whisper audio.mp3 --language en --model medium
```

3. **Provide initial prompt for context:**
```bash
whisper audio.mp3 --initial_prompt "Technical discussion about video editing software."
```

### Performance Tips

1. **GPU acceleration (if available):**
```bash
whisper audio.mp3 --model large-v3 --device cuda
```

2. **Process in chunks for long videos:**
```python
# Split audio into 10-minute chunks
# Transcribe each chunk
# Merge results with time offset adjustment
```

## Error Handling

```bash
# Validate audio file before transcription
validate_audio() {
  local file="$1"
  if ffprobe -v error -select_streams a:0 -show_entries stream=codec_type -of csv=p=0 "$file" 2>/dev/null | grep -q "audio"; then
    return 0
  else
    echo "Error: No audio stream found in $file"
    return 1
  fi
}

# Check Whisper installation
check_whisper() {
  if command -v whisper &> /dev/null; then
    echo "Whisper available"
    return 0
  else
    echo "Error: Whisper not installed. Run: pip install openai-whisper"
    return 1
  fi
}
```

## Related Skills

- **ffmpeg-core** - Audio extraction and preprocessing
- **final-cut-pro** - Import transcripts as titles/markers