openapi: 3.1.0 info: title: Text-to-Speech (TTS) version: 1.0.0 paths: /v0/tts/stream/json: post: operationId: synthesize-json-streaming summary: Text-to-Speech (Streamed JSON) description: >- Streams synthesized speech using the specified voice. If no voice is provided, a novel voice will be generated dynamically. Optionally, additional context can be included to influence the speech's style and prosody. The response is a stream of JSON objects including audio encoded in base64. tags: - '' parameters: - name: X-Hume-Api-Key in: header required: true schema: type: string responses: '200': description: Successful Response content: text/event-stream: schema: $ref: '#/components/schemas/TtsOutput' '422': description: Validation Error content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' requestBody: content: application/json: schema: $ref: '#/components/schemas/OctaveBodyArgsStream' /v0/tts/stream/file: post: operationId: synthesize-file-streaming summary: Text-to-Speech (Streamed File) description: >- Streams synthesized speech using the specified voice. If no voice is provided, a novel voice will be generated dynamically. Optionally, additional context can be included to influence the speech's style and prosody. tags: - '' parameters: - name: X-Hume-Api-Key in: header required: true schema: type: string responses: '200': description: OK content: application/octet-stream: schema: type: string format: binary '422': description: Validation Error content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' requestBody: content: application/json: schema: $ref: '#/components/schemas/OctaveBodyArgsStream' /v0/tts: post: operationId: synthesize-json summary: Text-to-Speech (Json) description: >- Synthesizes one or more input texts into speech using the specified voice. If no voice is provided, a novel voice will be generated dynamically. Optionally, additional context can be included to influence the speech's style and prosody. The response includes the base64-encoded audio and metadata in JSON format. tags: - '' parameters: - name: X-Hume-Api-Key in: header required: true schema: type: string responses: '200': description: Successful Response content: application/json: schema: $ref: '#/components/schemas/OctaveResponse' '422': description: Validation Error content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' requestBody: content: application/json: schema: $ref: '#/components/schemas/OctaveBodyArgs' /v0/tts/file: post: operationId: synthesize-file summary: Text-to-Speech (File) description: >- Synthesizes one or more input texts into speech using the specified voice. If no voice is provided, a novel voice will be generated dynamically. Optionally, additional context can be included to influence the speech's style and prosody. The response contains the generated audio file in the requested format. tags: - '' parameters: - name: X-Hume-Api-Key in: header required: true schema: type: string responses: '200': description: OK content: application/octet-stream: schema: type: string format: binary '422': description: Validation Error content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' requestBody: content: application/json: schema: $ref: '#/components/schemas/OctaveBodyArgs' /v0/tts/voice_conversion/file: post: operationId: convert-voice-file summary: Voice Conversion (Streamed File) tags: - '' parameters: - name: X-Hume-Api-Key in: header required: true schema: type: string responses: '200': description: Successful Response content: application/octet-stream: schema: type: string format: binary '422': description: Validation Error content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' requestBody: content: multipart/form-data: schema: type: object properties: strip_headers: type: boolean description: >- If enabled, the audio for all the chunks of a generation, once concatenated together, will constitute a single audio file. Otherwise, if disabled, each chunk's audio will be its own audio file, each with its own headers (if applicable). audio: type: string format: binary description: >- Audio file containing speech to be converted to the target voice. Supported formats include `MP3`, `WAV`, `M4A`, and `OGG`. context: oneOf: - $ref: >- #/components/schemas/V0TtsVoiceConversionFilePostRequestBodyContentMultipartFormDataSchemaContext - type: 'null' description: >- Utterances to use as context for generating consistent speech style and prosody across multiple requests. These will not be converted to speech output. voice: $ref: '#/components/schemas/VoiceRef' format: $ref: '#/components/schemas/Format' description: Specifies the output audio file format. include_timestamp_types: type: array items: $ref: '#/components/schemas/TimestampType' description: >- The set of timestamp types to include in the response. When used in multipart/form-data, specify each value using bracket notation: `include_timestamp_types[0]=word&include_timestamp_types[1]=phoneme`. Only supported for Octave 2 requests. required: - audio /v0/tts/voice_conversion/json: post: operationId: convert-voice-json summary: Voice Conversion (Streamed JSON) tags: - '' parameters: - name: X-Hume-Api-Key in: header required: true schema: type: string responses: '200': description: Successful Response content: text/event-stream: schema: $ref: '#/components/schemas/TtsOutput' '422': description: Validation Error content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' requestBody: content: multipart/form-data: schema: type: object properties: strip_headers: type: boolean description: >- If enabled, the audio for all the chunks of a generation, once concatenated together, will constitute a single audio file. Otherwise, if disabled, each chunk's audio will be its own audio file, each with its own headers (if applicable). audio: type: string format: binary description: >- Audio file containing speech to be converted to the target voice. Supported formats include `MP3`, `WAV`, `M4A`, and `OGG`. context: oneOf: - $ref: >- #/components/schemas/V0TtsVoiceConversionJsonPostRequestBodyContentMultipartFormDataSchemaContext - type: 'null' description: >- Utterances to use as context for generating consistent speech style and prosody across multiple requests. These will not be converted to speech output. voice: $ref: '#/components/schemas/VoiceRef' format: $ref: '#/components/schemas/Format' description: Specifies the output audio file format. include_timestamp_types: type: array items: $ref: '#/components/schemas/TimestampType' description: >- The set of timestamp types to include in the response. When used in multipart/form-data, specify each value using bracket notation: `include_timestamp_types[0]=word&include_timestamp_types[1]=phoneme`. Only supported for Octave 2 requests. servers: - url: https://api.hume.ai components: schemas: ContextGenerationId: type: object properties: generation_id: type: string format: uuid4 description: >- The ID of a prior TTS generation to use as context for generating consistent speech style and prosody across multiple requests. Including context may increase audio generation times. required: - generation_id title: ContextGenerationId VoiceProvider: type: string enum: - HUME_AI - CUSTOM_VOICE title: VoiceProvider VoiceId: type: object properties: id: type: string description: The unique ID associated with the **Voice**. provider: $ref: '#/components/schemas/VoiceProvider' description: >- Specifies the source provider associated with the chosen voice. - **`HUME_AI`**: Select voices from Hume's [Voice Library](https://app.hume.ai/tts/voice-library), containing a variety of preset, shared voices. - **`CUSTOM_VOICE`**: Select from voices you've personally generated and saved in your account. If no provider is explicitly set, the default provider is `CUSTOM_VOICE`. When using voices from Hume's **Voice Library**, you must explicitly set the provider to `HUME_AI`. Preset voices from Hume's **Voice Library** are accessible by all users. In contrast, your custom voices are private and accessible only via requests authenticated with your API key. required: - id title: VoiceId VoiceName: type: object properties: name: type: string description: The name of a **Voice**. provider: $ref: '#/components/schemas/VoiceProvider' description: >- Specifies the source provider associated with the chosen voice. - **`HUME_AI`**: Select voices from Hume's [Voice Library](https://app.hume.ai/tts/voice-library), containing a variety of preset, shared voices. - **`CUSTOM_VOICE`**: Select from voices you've personally generated and saved in your account. If no provider is explicitly set, the default provider is `CUSTOM_VOICE`. When using voices from Hume's **Voice Library**, you must explicitly set the provider to `HUME_AI`. Preset voices from Hume's **Voice Library** are accessible by all users. In contrast, your custom voices are private and accessible only via requests authenticated with your API key. required: - name title: VoiceName VoiceRef: oneOf: - $ref: '#/components/schemas/VoiceId' - $ref: '#/components/schemas/VoiceName' title: VoiceRef Utterance: type: object properties: description: type: - string - 'null' description: >- Natural language instructions describing how the synthesized speech should sound, including but not limited to tone, intonation, pacing, and accent. **This field behaves differently depending on whether a voice is specified**: - **Voice specified**: the description will serve as acting directions for delivery. Keep directions concise—100 characters or fewer—for best results. See our guide on [acting instructions](/docs/text-to-speech-tts/acting-instructions). - **Voice not specified**: the description will serve as a voice prompt for generating a voice. See our [prompting guide](/docs/text-to-speech-tts/prompting) for design tips. speed: type: number format: double default: 1 description: >- Speed multiplier for the synthesized speech. Extreme values below 0.75 and above 1.5 may sometimes cause instability to the generated output. text: type: string description: The input text to be synthesized into speech. trailing_silence: type: number format: double default: 0 description: Duration of trailing silence (in seconds) to add to this utterance voice: oneOf: - $ref: '#/components/schemas/VoiceRef' - type: 'null' description: >- The `name` or `id` associated with a **Voice** from the **Voice Library** to be used as the speaker for this and all subsequent `utterances`, until the `voice` field is updated again. See our [voices guide](/docs/text-to-speech-tts/voices) for more details on generating and specifying **Voices**. required: - text title: Utterance ContextUtterances: type: object properties: utterances: type: array items: $ref: '#/components/schemas/Utterance' required: - utterances title: ContextUtterances OctaveBodyArgsStreamContext: oneOf: - $ref: '#/components/schemas/ContextGenerationId' - $ref: '#/components/schemas/ContextUtterances' description: >- Utterances to use as context for generating consistent speech style and prosody across multiple requests. These will not be converted to speech output. title: OctaveBodyArgsStreamContext OctaveBodyArgsStreamFormat: oneOf: - type: object properties: type: type: string enum: - mp3 description: 'Discriminator value: mp3' required: - type description: Mp3Format variant - type: object properties: type: type: string enum: - pcm description: 'Discriminator value: pcm' required: - type description: PcmFormat variant - type: object properties: type: type: string enum: - wav description: 'Discriminator value: wav' required: - type description: WavFormat variant discriminator: propertyName: type description: Specifies the output audio file format. title: OctaveBodyArgsStreamFormat TimestampType: type: string enum: - word - phoneme title: TimestampType OctaveVersion: type: string enum: - '1' - '2' description: |- Selects the Octave model version used to synthesize speech for this request. If you omit this field, Hume automatically routes the request to the most appropriate model. Setting a specific version ensures stable and repeatable behavior across requests. Use `2` to opt into the latest Octave capabilities. When you specify version `2`, you must also provide a `voice`. Requests that set `version: 2` without a voice will be rejected. For a comparison of Octave versions, see the [Octave versions](/docs/text-to-speech-tts/overview#octave-versions) section in the TTS overview. title: OctaveVersion OctaveBodyArgsStream: type: object properties: context: oneOf: - $ref: '#/components/schemas/OctaveBodyArgsStreamContext' - type: 'null' description: >- Utterances to use as context for generating consistent speech style and prosody across multiple requests. These will not be converted to speech output. format: $ref: '#/components/schemas/OctaveBodyArgsStreamFormat' description: Specifies the output audio file format. include_timestamp_types: type: array items: $ref: '#/components/schemas/TimestampType' description: The set of timestamp types to include in the response. Only supported for Octave 2 requests. instant_mode: type: boolean default: true description: >- Enables ultra-low latency streaming, significantly reducing the time until the first audio chunk is received. Recommended for real-time applications requiring immediate audio playback. For further details, see our documentation on [instant mode](/docs/text-to-speech-tts/overview#ultra-low-latency-streaming-instant-mode). - A [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice) must be specified when instant mode is enabled. Dynamic voice generation is not supported with this mode. - Instant mode is only supported for streaming endpoints (e.g., [/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming), [/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)). - Ensure only a single generation is requested ([num_generations](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.num_generations) must be `1` or omitted). num_generations: type: integer default: 1 description: >- Number of audio generations to produce from the input utterances. Using `num_generations` enables faster processing than issuing multiple sequential requests. Additionally, specifying `num_generations` allows prosody continuation across all generations without repeating context, ensuring each generation sounds slightly different while maintaining contextual consistency. split_utterances: type: boolean default: true description: >- Controls how audio output is segmented in the response. - When **enabled** (`true`), input utterances are automatically split into natural-sounding speech segments. - When **disabled** (`false`), the response maintains a strict one-to-one mapping between input utterances and output snippets. This setting affects how the `snippets` array is structured in the response, which may be important for applications that need to track the relationship between input text and generated audio segments. When setting to `false`, avoid including utterances with long `text`, as this can result in distorted output. strip_headers: type: boolean default: false description: >- If enabled, the audio for all the chunks of a generation, once concatenated together, will constitute a single audio file. Otherwise, if disabled, each chunk's audio will be its own audio file, each with its own headers (if applicable). utterances: type: array items: $ref: '#/components/schemas/Utterance' description: >- A list of **Utterances** to be converted to speech output. An **Utterance** is a unit of input for [Octave](/docs/text-to-speech-tts/overview), and includes input `text`, an optional `description` to serve as the prompt for how the speech should be delivered, an optional `voice` specification, and additional controls to guide delivery for `speed` and `trailing_silence`. version: $ref: '#/components/schemas/OctaveVersion' description: >- Selects the Octave model version used to synthesize speech for this request. If you omit this field, Hume automatically routes the request to the most appropriate model. Setting a specific version ensures stable and repeatable behavior across requests. Use `2` to opt into the latest Octave capabilities. When you specify version `2`, you must also provide a `voice`. Requests that set `version: 2` without a voice will be rejected. For a comparison of Octave versions, see the [Octave versions](/docs/text-to-speech-tts/overview#octave-versions) section in the TTS overview. required: - utterances title: OctaveBodyArgsStream FormatType: type: string enum: - mp3 - pcm - wav title: FormatType MillisecondInterval: type: object properties: begin: type: integer description: Start time of the interval in milliseconds. end: type: integer description: End time of the interval in milliseconds. required: - begin - end title: MillisecondInterval Timestamp-Output: type: object properties: text: type: string description: The word or phoneme text that the timestamp corresponds to. time: $ref: '#/components/schemas/MillisecondInterval' description: The start and end timestamps for the word or phoneme in milliseconds. type: $ref: '#/components/schemas/TimestampType' required: - text - time - type title: Timestamp-Output Snippet-Output: type: object properties: audio: type: string description: The segmented audio output in the requested format, encoded as a base64 string. generation_id: type: string format: uuid4 description: The generation ID this snippet corresponds to. id: type: string format: uuid4 description: A unique ID associated with this **Snippet**. text: type: string description: The text for this **Snippet**. timestamps: type: array items: $ref: '#/components/schemas/Timestamp-Output' description: >- A list of word or phoneme level timestamps for the generated audio. Timestamps are only returned for Octave 2 requests. transcribed_text: type: - string - 'null' description: The transcribed text of the generated audio. It is only present if `instant_mode` is set to `false`. utterance_index: type: - integer - 'null' description: The index of the utterance in the request this snippet corresponds to. required: - audio - generation_id - id - text - timestamps - transcribed_text - utterance_index title: Snippet-Output TtsOutput: oneOf: - type: object properties: type: type: string enum: - audio description: 'Discriminator value: audio' audio: type: string description: The generated audio output chunk in the requested format. audio_format: $ref: '#/components/schemas/FormatType' description: The generated audio output format. chunk_index: type: integer description: The index of the audio chunk in the snippet. generation_id: type: string format: uuid4 description: The generation ID of the parent snippet that this chunk corresponds to. is_last_chunk: type: boolean description: Whether or not this is the last chunk streamed back from the decoder for one input snippet. request_id: type: string description: ID of the initiating request. snippet: $ref: '#/components/schemas/Snippet-Output' snippet_id: type: string format: uuid4 description: The ID of the parent snippet that this chunk corresponds to. text: type: string description: The text of the parent snippet that this chunk corresponds to. transcribed_text: type: - string - 'null' description: >- The transcribed text of the generated audio of the parent snippet that this chunk corresponds to. It is only present if `instant_mode` is set to `false`. utterance_index: type: - integer - 'null' description: The index of the utterance in the request that the parent snippet of this chunk corresponds to. required: - type - audio - audio_format - chunk_index - generation_id - is_last_chunk - request_id - snippet_id - text - transcribed_text - utterance_index description: SnippetAudioChunk variant - type: object properties: type: type: string enum: - timestamp description: 'Discriminator value: timestamp' generation_id: type: string format: uuid4 description: The generation ID of the parent snippet that this chunk corresponds to. request_id: type: string description: ID of the initiating request. snippet_id: type: string format: uuid4 description: The ID of the parent snippet that this chunk corresponds to. timestamp: $ref: '#/components/schemas/Timestamp-Output' description: A word or phoneme level timestamp for the generated audio. required: - type - generation_id - request_id - snippet_id - timestamp description: OctaveOutputTimestamp variant discriminator: propertyName: type title: TtsOutput ValidationErrorLocItems: oneOf: - type: string - type: integer title: ValidationErrorLocItems ValidationError: type: object properties: loc: type: array items: $ref: '#/components/schemas/ValidationErrorLocItems' msg: type: string type: type: string required: - loc - msg - type title: ValidationError HTTPValidationError: type: object properties: detail: type: array items: $ref: '#/components/schemas/ValidationError' title: HTTPValidationError OctaveBodyArgsContext: oneOf: - $ref: '#/components/schemas/ContextGenerationId' - $ref: '#/components/schemas/ContextUtterances' description: >- Utterances to use as context for generating consistent speech style and prosody across multiple requests. These will not be converted to speech output. title: OctaveBodyArgsContext Format: oneOf: - type: object properties: type: type: string enum: - mp3 description: 'Discriminator value: mp3' required: - type description: Mp3Format variant - type: object properties: type: type: string enum: - pcm description: 'Discriminator value: pcm' required: - type description: PcmFormat variant - type: object properties: type: type: string enum: - wav description: 'Discriminator value: wav' required: - type description: WavFormat variant discriminator: propertyName: type description: Specifies the output audio file format. title: Format OctaveBodyArgs: type: object properties: context: oneOf: - $ref: '#/components/schemas/OctaveBodyArgsContext' - type: 'null' description: >- Utterances to use as context for generating consistent speech style and prosody across multiple requests. These will not be converted to speech output. format: $ref: '#/components/schemas/Format' description: Specifies the output audio file format. include_timestamp_types: type: array items: $ref: '#/components/schemas/TimestampType' description: The set of timestamp types to include in the response. Only supported for Octave 2 requests. num_generations: type: integer default: 1 description: >- Number of audio generations to produce from the input utterances. Using `num_generations` enables faster processing than issuing multiple sequential requests. Additionally, specifying `num_generations` allows prosody continuation across all generations without repeating context, ensuring each generation sounds slightly different while maintaining contextual consistency. split_utterances: type: boolean default: true description: >- Controls how audio output is segmented in the response. - When **enabled** (`true`), input utterances are automatically split into natural-sounding speech segments. - When **disabled** (`false`), the response maintains a strict one-to-one mapping between input utterances and output snippets. This setting affects how the `snippets` array is structured in the response, which may be important for applications that need to track the relationship between input text and generated audio segments. When setting to `false`, avoid including utterances with long `text`, as this can result in distorted output. strip_headers: type: boolean default: false description: >- If enabled, the audio for all the chunks of a generation, once concatenated together, will constitute a single audio file. Otherwise, if disabled, each chunk's audio will be its own audio file, each with its own headers (if applicable). utterances: type: array items: $ref: '#/components/schemas/Utterance' description: >- A list of **Utterances** to be converted to speech output. An **Utterance** is a unit of input for [Octave](/docs/text-to-speech-tts/overview), and includes input `text`, an optional `description` to serve as the prompt for how the speech should be delivered, an optional `voice` specification, and additional controls to guide delivery for `speed` and `trailing_silence`. version: $ref: '#/components/schemas/OctaveVersion' description: >- Selects the Octave model version used to synthesize speech for this request. If you omit this field, Hume automatically routes the request to the most appropriate model. Setting a specific version ensures stable and repeatable behavior across requests. Use `2` to opt into the latest Octave capabilities. When you specify version `2`, you must also provide a `voice`. Requests that set `version: 2` without a voice will be rejected. For a comparison of Octave versions, see the [Octave versions](/docs/text-to-speech-tts/overview#octave-versions) section in the TTS overview. required: - utterances title: OctaveBodyArgs Encoding: type: object properties: format: $ref: '#/components/schemas/FormatType' description: Format for the output audio. sample_rate: type: integer description: The sample rate (`Hz`) of the generated audio. The default sample rate is `48000 Hz`. required: - format - sample_rate description: Encoding information about the generated audio, including the `format` and `sample_rate`. title: Encoding Generation: type: object properties: audio: type: string description: The generated audio output in the requested format, encoded as a base64 string. duration: type: number format: double description: Duration of the generated audio in seconds. encoding: $ref: '#/components/schemas/Encoding' file_size: type: integer description: Size of the generated audio in bytes. generation_id: type: string format: uuid4 description: >- A unique ID associated with this TTS generation that can be used as context for generating consistent speech style and prosody across multiple requests. snippets: type: array items: type: array items: $ref: '#/components/schemas/Snippet-Output' description: >- A list of snippet groups where each group corresponds to an utterance in the request. Each group contains segmented snippets that represent the original utterance divided into more natural-sounding units optimized for speech delivery. required: - audio - duration - encoding - file_size - generation_id - snippets title: Generation OctaveResponse: type: object properties: generations: type: array items: $ref: '#/components/schemas/Generation' request_id: type: - string - 'null' description: >- A unique ID associated with this request for tracking and troubleshooting. Use this ID when contacting [support](/support) for troubleshooting assistance. required: - generations - request_id title: OctaveResponse V0TtsVoiceConversionFilePostRequestBodyContentMultipartFormDataSchemaContext: oneOf: - $ref: '#/components/schemas/ContextGenerationId' - $ref: '#/components/schemas/ContextUtterances' description: >- Utterances to use as context for generating consistent speech style and prosody across multiple requests. These will not be converted to speech output. title: V0TtsVoiceConversionFilePostRequestBodyContentMultipartFormDataSchemaContext V0TtsVoiceConversionJsonPostRequestBodyContentMultipartFormDataSchemaContext: oneOf: - $ref: '#/components/schemas/ContextGenerationId' - $ref: '#/components/schemas/ContextUtterances' description: >- Utterances to use as context for generating consistent speech style and prosody across multiple requests. These will not be converted to speech output. title: V0TtsVoiceConversionJsonPostRequestBodyContentMultipartFormDataSchemaContext securitySchemes: bearerAuth: type: apiKey in: header name: X-Hume-Api-Key