--- name: elevenlabs-stt description: "ElevenLabs speech-to-text with Scribe models and forced alignment via inference.sh CLI. Models: Scribe v1/v2 (98%+ accuracy, 90+ languages). Capabilities: transcription, speaker diarization, audio event tagging, word-level timestamps, forced alignment, subtitle generation. Use for: meeting transcription, subtitles, podcast transcripts, lip-sync timing, karaoke. Triggers: elevenlabs stt, elevenlabs transcription, scribe, elevenlabs speech to text, forced alignment, word alignment, subtitle timing, diarization, speaker identification, audio event detection, eleven labs transcribe" allowed-tools: Bash(belt *) --- > **Install the belt CLI skill:** `npx skills add belt-sh/cli` # ElevenLabs Speech-to-Text High-accuracy transcription with Scribe models via [inference.sh](https://inference.sh) CLI. ![ElevenLabs STT](https://cloud.inference.sh/u/4mg21r6ta37mpaz6ktzwtt8krr/01jz025e88nkvw55at1rqtj5t8.png) ## Quick Start > Requires inference.sh CLI (`belt`). [Install instructions](https://raw.githubusercontent.com/inference-sh/skills/refs/heads/main/cli-install.md) ```bash belt login # Transcribe audio belt app run elevenlabs/stt --input '{"audio": "https://audio.mp3"}' ``` ## Available Models | Model | ID | Best For | |-------|----|----------| | Scribe v2 | `scribe_v2` | Latest, highest accuracy (default) | | Scribe v1 | `scribe_v1` | Stable, proven | - 98%+ transcription accuracy - 90+ languages with auto-detection ## Examples ### Basic Transcription ```bash belt app run elevenlabs/stt --input '{"audio": "https://meeting-recording.mp3"}' ``` ### With Speaker Identification ```bash belt app run elevenlabs/stt --input '{ "audio": "https://meeting.mp3", "diarize": true }' ``` ### Audio Event Tagging Detect laughter, applause, music, and other non-speech events: ```bash belt app run elevenlabs/stt --input '{ "audio": "https://podcast.mp3", "tag_audio_events": true }' ``` ### Specify Language ```bash belt app run elevenlabs/stt --input '{ "audio": "https://spanish-audio.mp3", "language_code": "spa" }' ``` ### Full Options ```bash belt app run elevenlabs/stt --input '{ "audio": "https://conference.mp3", "model": "scribe_v2", "diarize": true, "tag_audio_events": true, "language_code": "eng" }' ``` ## Forced Alignment Get precise word-level and character-level timestamps by aligning known text to audio. Useful for subtitles, lip-sync, and karaoke. ```bash belt app run elevenlabs/forced-alignment --input '{ "audio": "https://narration.mp3", "text": "This is the exact text spoken in the audio file." }' ``` ### Output Format ```json { "words": [ {"text": "This", "start": 0.0, "end": 0.3}, {"text": "is", "start": 0.35, "end": 0.5}, {"text": "the", "start": 0.55, "end": 0.65} ], "text": "This is the exact text spoken in the audio file." } ``` ### Forced Alignment Use Cases - **Subtitles**: Precise timing for video captions - **Lip-sync**: Align audio to animated characters - **Karaoke**: Word-by-word timing for lyrics - **Accessibility**: Synchronized transcripts ## Workflow: Video Subtitles ```bash # 1. Transcribe video audio belt app run elevenlabs/stt --input '{ "audio": "https://video.mp4", "diarize": true }' > transcript.json # 2. Use transcript for captions belt app run infsh/caption-videos --input '{ "video_url": "https://video.mp4", "captions": "" }' ``` ## Supported Languages 90+ languages including: English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Russian, Turkish, Dutch, Swedish, and many more. Leave `language_code` empty for automatic detection. ## Use Cases - **Meetings**: Transcribe recordings with speaker identification - **Podcasts**: Generate transcripts with audio event tags - **Subtitles**: Create timed captions for videos - **Research**: Interview transcription with diarization - **Accessibility**: Make audio content searchable and accessible - **Lip-sync**: Forced alignment for animation timing ## Related Skills ```bash # ElevenLabs TTS (reverse direction) npx skills add inference-sh/skills@elevenlabs-tts # ElevenLabs dubbing (translate audio) npx skills add inference-sh/skills@elevenlabs-dubbing # Other STT models (Whisper) npx skills add inference-sh/skills@speech-to-text # Full platform skill (all 250+ apps) npx skills add inference-sh/skills@infsh-cli ``` Browse all audio apps: `belt app store --category audio`