API Reference
The API is defined as a protobuf spec, so native bindings can be generated in any language with gRPC support. We recommend using buf to generate the bindings.
This section of the documentation is auto-generated from the protobuf spec. The service contains the methods that can be called, and the “messages” are the data structures (objects, classes or structs in the generated code, depending on the language) passed to and from the methods.
Table of Contents
- Table of Contents
- VoiceGenService
- Messages
- Enums
- Scalar Value Types
VoiceGenService
Service that implements the Cobalt VoiceGen API.
Version
Version(VersionRequest) VersionResponse
Returns version information from the server.
ListModels
ListModels(ListModelsRequest) ListModelsResponse
ListModels returns information about the models the server can access.
StreamingSynthesize
StreamingSynthesize(StreamingSynthesizeRequest) StreamingSynthesizeResponse
Performs text to speech synthesis and stream synthesized audio. This method is only available via GRPC and not via HTTP+JSON. However, a web browser may use websockets to use this service.
Messages
- If two or more fields in a message are labeled oneof, then each method call using that message must have exactly one of the fields populated
- If a field is labeled
repeated, then the generated code will accept an array (or struct, or list depending on the language).
AudioFormat
Details of audio in format
Fields
-
sample_rate (uint32 ) Sampling rate in Hz.
-
channels (uint32 ) Number of channels present in the audio. E.g.: 1 (mono), 2 (stereo), etc.
-
bit_depth (uint32 ) Bit depth of each sample (e.g. 8, 16, 24, 32, etc.).
-
codec (AudioCodec ) Codec of the samples.
-
encoding (AudioEncoding ) Encoding of the samples.
-
byte_order (ByteOrder ) Byte order of the samples. This field must be set to a value other than
BYTE_ORDER_UNSPECIFIEDwhen thebit_depthis greater than 8.
ListModelsRequest
The top-level message sent by the client for the ListModels method.
ListModelsResponse
The message returned to the client by the ListModels method.
Fields
- models (ModelInfo repeated) List of models available for use on Privacy Screen server.
ModelAttributes
Attributes of a VoiceGen Model
Fields
-
language (string ) Language of the model.
-
phone_set (PhoneSet ) The set of phonemes this model uses to represent how words should be pronounced.
-
native_audio_format (AudioFormat ) Native audio format of the model. This will be use as default value if audio format in
SynthesisConfigis not specify. -
supported_features (ModelFeatures ) Supported model features.
-
speakers (SpeakerInfo repeated) List of speaker available for use in this model.
ModelFeatures
Fields
-
speech_rate (bool ) This is set to true if the model can be configured to synthesize audio at different talking speeds.
-
variation_scale (bool ) This is set to true if the model can be configured to synthesize audio for a given text input differently than usual by varying stresses, and emphasis on different parts of the audio. This feature is useful for making the audio sound slightly different each time to avoid making it feel monotonous.
ModelInfo
Description of a Cobalt VoiceGen Model
Fields
-
id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for synthesis, and is specified in the
SynthesisConfigmessage. -
name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their synthesis task.
-
attributes (ModelAttributes ) Model attributes.
SpeakerAttributes
Attributes of a speaker
Fields
- language (string ) Language of the speaker. This can be different from model language. E.g. an english model with different accents: en-US, en-GB, en-IN etc.
SpeakerInfo
Description of a speaker
Fields
-
id (string ) Unique identifier of the speaker. This identifier is used to choose the speaker that should be used for synthesis, and is specified in the
SynthesisConfigmessage. -
name (string ) Speaker name. This is a concise name describing the speaker, and may be presented to the end-user, for example, to help choose which speaker to use for their synthesis task.
-
description (string ) Speaker description. This is may be presented to the end-user, for example, to help choose which speaker to use for their synthesis task.
-
attributes (SpeakerAttributes ) Speaker attributes.
StreamingSynthesizeRequest
The top-level messages sent by the client for the StreamingSynthesize
method.
Fields
-
config (SynthesisConfig )
-
text (SynthesisText )
StreamingSynthesizeResponse
The top-level message sent by the server for the StreamingSynthesize
method. In this streaming call, multiple StreamingSynthesizeResponse
messages contain SynthesizedAudio.
Fields
- audio (SynthesizedAudio )
SynthesisConfig
Configuration for setting up a Synthesizer
Fields
-
model_id (string ) Unique identifier of the model to use, as obtained from a
ModelInfomessage. -
speaker_id (string ) Unique identifier of the speaker to use, as obtained from a
SpeakerInfomessage. -
audio_format (AudioFormat ) Format of the audio to be sent for synthesis. If no value specify, default value of native audio format of the specified model will be used. Native audio format can be obtained from
ModelAttributesmessage. -
speech_rate (float ) The speech rate for synthesized audio. If unset, then the default speech rate of a given model is used. Otherwise a value > 0 should be used, with higher values resulting in faster speech. This field only has an effect on the synthesized audio if the model supports it, which can be ascertained from the
ModelAttributes.supported_features. -
variation_scale (float ) A scale with values > 0, to determine how much to randomly vary the synthesized audio by altering stresses and emphasis on different parts of the audio. Higher values correspond to greater variation. This field only has an affect on the synthesized audio if the model supports it, which can be ascertained from the
ModelAttributes.supported_features.
SynthesisText
Text input to be sent to the synthesizer
Fields
- text (string )
SynthesizedAudio
Synthesize audio from the synthesizer
Fields
- data (bytes )
VersionRequest
The top-level message sent by the client for the Version method.
VersionResponse
The top-level message sent by the server for the Version method.
Fields
- version (string ) Version of the server handling these requests.
Enums
AudioCodec
The encoding of the audio data to be sent for synthesis.
| Name | Number | Description |
|---|---|---|
| AUDIO_CODEC_UNSPECIFIED | 0 | AUDIO_CODEC_UNSPECIFIED is the default value of this type. |
| AUDIO_CODEC_RAW | 2 | Raw data without any headers |
| AUDIO_CODEC_WAV | 1 | WAV with RIFF headers |
AudioEncoding
The encoding of the audio data to be sent for synthesis.
| Name | Number | Description |
|---|---|---|
| AUDIO_ENCODING_UNSPECIFIED | 0 | AUDIO_ENCODING_UNSPECIFIED is the default value of this type and will result in an error. |
| AUDIO_ENCODING_SIGNED | 1 | PCM signed-integer |
| AUDIO_ENCODING_UNSIGNED | 2 | PCM unsigned-integer |
| AUDIO_ENCODING_IEEE_FLOAT | 3 | PCM IEEE-Float |
| AUDIO_ENCODING_ULAW | 4 | G.711 mu-law |
| AUDIO_ENCODING_ALAW | 5 | G.711 a-law |
ByteOrder
Byte order of multi-byte data
| Name | Number | Description |
|---|---|---|
| BYTE_ORDER_UNSPECIFIED | 0 | BYTE_ORDER_UNSPECIFIED is the default value of this type. |
| BYTE_ORDER_LITTLE_ENDIAN | 1 | Little Endian byte order |
| BYTE_ORDER_BIG_ENDIAN | 2 | Big Endian byte order |
PhoneSet
PhoneSet is a set of phonemes for words pronunciation.
| Name | Number | Description |
|---|---|---|
| PHONE_SET_UNSPECIFIED | 0 | PHONE_SET_UNSPECIFIED is the default value of this type. |
| PHONE_SET_IPA | 1 | IPA phoneme set |
| PHONE_SET_XSAMPA | 2 | X-SAMPA phoneme set |
| PHONE_SET_ARPABET | 3 | ARPAbet phoneme set |