--- name: speech-to-text description: "Use this skill whenever the user wants to transcribe audio to text, convert speech to text, or get a transcript from an audio or video file. Triggers include: any mention of 'transcribe', 'transcription', 'speech to text', 'STT', 'convert audio to text', 'what does this audio say', 'get transcript', 'subtitle generation', or requests to extract spoken words from a file. Also use when the user wants speaker identification from audio, timestamps for captions, or multilingual transcription." permissions: - network - filesystem metadata: {"openclaw": {"primaryEnv": "NOIZ_API_KEY"}} --- # speech-to-text Transcribe any audio file to text. Supports multilingual auto-detection, timestamps, and speaker labels. ## Triggers - transcribe / transcript / transcription - speech to text / STT / audio to text - what does this audio say / convert audio - 转录 / 语音转文字 / 识别音频 ## Quick Start ```bash # Transcribe with auto language detection python3 skills/speech-to-text/scripts/stt.py audio.mp3 # Specify language explicitly python3 skills/speech-to-text/scripts/stt.py interview.wav --language en # Save transcript to file python3 skills/speech-to-text/scripts/stt.py podcast.m4a -o transcript.txt # Output full JSON (with timestamps and speaker labels) python3 skills/speech-to-text/scripts/stt.py meeting.wav --json -o result.json ``` ## Arguments | Argument | Default | Description | |----------|---------|-------------| | `file` | required | Audio file to transcribe (mp3, wav, m4a, ogg, flac, aac, webm). Max 50 MB, max 10 min. | | `--language` / `-l` | auto-detect | BCP-47 language code (e.g. `en`, `zh`, `ja`). Omit to auto-detect. | | `--output` / `-o` | stdout | Path to save transcript text (or JSON if `--json` is set). | | `--json` | off | Output full JSON response with timestamps and speaker labels. | | `--api-key` | from env/config | Noiz API key (overrides stored key). | ## Output Format Without `--json`, only the transcript text is printed: ``` Hello, welcome to today's podcast. We have a special guest joining us... ``` With `--json`, the full structured response is printed: ```json { "language": "en", "transcript": "Hello, welcome to today's podcast...", "duration": 42.5, "segments": [ {"text": "Hello, welcome to today's podcast.", "start": 0.0, "end": 3.2, "spk": 0}, {"text": "We have a special guest joining us.", "start": 3.5, "end": 6.1, "spk": 0} ] } ``` ## Supported Languages Common codes: `en` (English), `zh` (Chinese), `ja` (Japanese), `ko` (Korean), `es` (Spanish), `fr` (French), `de` (German), `pt` (Portuguese), `ru` (Russian), `ar` (Arabic). Omit `--language` to auto-detect. ## Configuration ```bash # Save your API key once python3 skills/speech-to-text/scripts/stt.py config --set-api-key YOUR_KEY # Or set via environment variable export NOIZ_API_KEY=YOUR_KEY ``` Get your API key at [developers.noiz.ai](https://developers.noiz.ai/api-keys). ## Pricing Billed at **$0.0006 per second** of audio. A 10-minute file costs ~$0.36. New accounts include 10,000 free TTS characters; STT is billed separately. ## Security & data disclosure - **Credential storage**: API key is saved to `~/.config/noiz/api_key` (permissions `0600`). `NOIZ_API_KEY` env var is also supported. - **Network calls**: The audio file is uploaded to `https://noiz.ai/v1/speech-to-text` for transcription. No data is sent until you run the command. - **File limits**: Max 50 MB per file, max 10 minutes (600 seconds) of audio. ## Requirements - `requests` package: `pip install requests` - Get your API key at [developers.noiz.ai](https://developers.noiz.ai/api-keys)