--- name: consolidate-transcripts description: Consolidate transcripts from a channel into a single file, sorted by date (newest first), up to 800K tokens. Use when preparing transcripts for LLM context or bulk analysis. --- # Consolidate Transcripts **Why?** LLMs have context limits. This skill merges multiple transcripts into a single file with accurate token counting, so you can feed an entire channel's content to Claude or GPT without exceeding limits. ## Quick Start ```bash python scripts/consolidate_transcripts.py ``` Output: `~/Documents/YTScriber//-consolidated.md` > [!NOTE] > This feature is currently a standalone script. A `ytscriber consolidate` CLI command is planned for a future release. --- ## Workflow ### 1. Identify the Channel List available channels: ```bash ls ~/Documents/YTScriber/ ``` ### 2. Choose Token Limit | Use Case | Recommended Limit | Flag | |----------|-------------------|------| | Claude (200K context) | 150000 | `--limit 150000` | | GPT-4 Turbo (128K) | 100000 | `--limit 100000` | | Full archive (Claude Pro) | 800000 | (default) | | Quick sample | 50000 | `--limit 50000` | > [!TIP] > The default 800K limit leaves ~200K tokens for prompts and responses when using Claude's 1M context. ### 3. Run Consolidation ```bash python scripts/consolidate_transcripts.py [--limit TOKENS] [--verbose] ``` **Examples:** ```bash # Default (800K tokens) python scripts/consolidate_transcripts.py library-of-minds # Custom limit for GPT-4 python scripts/consolidate_transcripts.py aws-reinvent-2025 --limit 100000 # Verbose output showing all included files python scripts/consolidate_transcripts.py dwarkesh-patel --verbose ``` ### 4. Verify Output Check the consolidated file was created: ```bash ls -la ~/Documents/YTScriber//*-consolidated.md ``` --- ## Parameters | Option | Description | Default | |--------|-------------|---------| | `channel_name` | Folder name in data directory | Required | | `--limit, -l` | Maximum tokens to include | 800000 | | `--verbose, -v` | Show detailed file list | False | --- ## Output Format The consolidated file includes: 1. **Header** — Generation metadata, total transcripts, token/word counts 2. **Table of Contents** — Dates, titles, tokens, words per transcript 3. **Transcripts** — Full text with title, date, author, source URL --- ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `ModuleNotFoundError: tiktoken` | tiktoken not installed | `pip install tiktoken` | | `No transcripts found` | Empty transcripts folder | Run `ytscriber download` first | | `FileNotFoundError` | Channel doesn't exist | Check `ls ~/Documents/YTScriber/` for valid names | | Output file is small | Few transcripts available | Use `--verbose` to see what was included | | Token count seems wrong | Old tiktoken version | `pip install --upgrade tiktoken` | --- ## Common Mistakes 1. **Wrong channel name** — Use the folder name exactly as shown in `ls ~/Documents/YTScriber/`, not the YouTube channel name. 2. **Forgetting to download transcripts first** — Consolidation requires transcripts to exist. Run `ytscriber download` first. 3. **Using too high a limit** — If you exceed your LLM's context, you'll get truncation errors. Use the limit guide above. 4. **Expecting real-time updates** — Re-run consolidation after downloading new transcripts. --- ## Reference - Transcripts sorted **newest first** (descending by date) - Files without dates in filename are placed last - Token counting uses `cl100k_base` encoding (GPT-4/Claude compatible) - Consolidated files are gitignored (not committed) - Re-running overwrites the previous consolidated file