API Reference

Detailed reference for API requests and types.

The API is defined as a protobuf spec, so native bindings can be generated in any language with gRPC support. We recommend using buf to generate the bindings.

This section of the documentation is auto-generated from the protobuf spec. The service contains the methods that can be called, and the “messages” are the data structures (objects, classes or structs in the generated code, depending on the language) passed to and from the methods.

Table of Contents
VoiceGenService
Messages
Enums
Scalar Value Types

VoiceGenService

Service that implements the Cobalt VoiceGen API.

Version

Version(VersionRequest) VersionResponse

Returns version information from the server.

ListModels

ListModels(ListModelsRequest) ListModelsResponse

ListModels returns information about the models the server can access.

StreamingSynthesize

StreamingSynthesize(StreamingSynthesizeRequest) StreamingSynthesizeResponse

Performs text to speech synthesis and stream synthesized audio. This method is only available via GRPC and not via HTTP+JSON. However, a web browser may use websockets to use this service.

Messages

If two or more fields in a message are labeled oneof, then each method call using that message must have exactly one of the fields populated
If a field is labeled repeated, then the generated code will accept an array (or struct, or list depending on the language).

AudioFormat

Details of audio in format

Fields

sample_rate (uint32 ) Sampling rate in Hz.
channels (uint32 ) Number of channels present in the audio. E.g.: 1 (mono), 2 (stereo), etc.
bit_depth (uint32 ) Bit depth of each sample (e.g. 8, 16, 24, 32, etc.).
codec (AudioCodec ) Codec of the samples.
encoding (AudioEncoding ) Encoding of the samples.
byte_order (ByteOrder ) Byte order of the samples. This field must be set to a value other than BYTE_ORDER_UNSPECIFIED when the bit_depth is greater than 8.

ListModelsRequest

The top-level message sent by the client for the ListModels method.

ListModelsResponse

The message returned to the client by the ListModels method.

Fields

models (ModelInfo repeated) List of models available for use on Privacy Screen server.

ModelAttributes

Attributes of a VoiceGen Model

Fields

language (string ) Language of the model.
phone_set (PhoneSet ) The set of phonemes this model uses to represent how words should be pronounced.
native_audio_format (AudioFormat ) Native audio format of the model. This will be use as default value if audio format in SynthesisConfig is not specify.
supported_features (ModelFeatures ) Supported model features.
speakers (SpeakerInfo repeated) List of speaker available for use in this model.

ModelFeatures

Fields

speech_rate (bool ) This is set to true if the model can be configured to synthesize audio at different talking speeds.
variation_scale (bool ) This is set to true if the model can be configured to synthesize audio for a given text input differently than usual by varying stresses, and emphasis on different parts of the audio. This feature is useful for making the audio sound slightly different each time to avoid making it feel monotonous.

ModelInfo

Description of a Cobalt VoiceGen Model

Fields

id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for synthesis, and is specified in the SynthesisConfig message.
name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their synthesis task.
attributes (ModelAttributes ) Model attributes.

SpeakerAttributes

Attributes of a speaker

Fields

language (string ) Language of the speaker. This can be different from model language. E.g. an english model with different accents: en-US, en-GB, en-IN etc.

SpeakerInfo

Description of a speaker

Fields

id (string ) Unique identifier of the speaker. This identifier is used to choose the speaker that should be used for synthesis, and is specified in the SynthesisConfig message.
name (string ) Speaker name. This is a concise name describing the speaker, and may be presented to the end-user, for example, to help choose which speaker to use for their synthesis task.
description (string ) Speaker description. This is may be presented to the end-user, for example, to help choose which speaker to use for their synthesis task.
attributes (SpeakerAttributes ) Speaker attributes.

StreamingSynthesizeRequest

The top-level messages sent by the client for the StreamingSynthesize method.

Fields

config (SynthesisConfig )
text (SynthesisText )

StreamingSynthesizeResponse

The top-level message sent by the server for the StreamingSynthesize method. In this streaming call, multiple StreamingSynthesizeResponse messages contain SynthesizedAudio.

Fields

audio (SynthesizedAudio )

SynthesisConfig

Configuration for setting up a Synthesizer

Fields

model_id (string ) Unique identifier of the model to use, as obtained from a ModelInfo message.
speaker_id (string ) Unique identifier of the speaker to use, as obtained from a SpeakerInfo message.
audio_format (AudioFormat ) Format of the audio to be sent for synthesis. If no value specify, default value of native audio format of the specified model will be used. Native audio format can be obtained from ModelAttributes message.
speech_rate (float ) The speech rate for synthesized audio. If unset, then the default speech rate of a given model is used. Otherwise a value > 0 should be used, with higher values resulting in faster speech. This field only has an effect on the synthesized audio if the model supports it, which can be ascertained from the ModelAttributes.supported_features.
variation_scale (float ) A scale with values > 0, to determine how much to randomly vary the synthesized audio by altering stresses and emphasis on different parts of the audio. Higher values correspond to greater variation. This field only has an affect on the synthesized audio if the model supports it, which can be ascertained from the ModelAttributes.supported_features.

SynthesisText

Text input to be sent to the synthesizer

Fields

text (string )

SynthesizedAudio

Synthesize audio from the synthesizer

Fields

data (bytes )

VersionRequest

The top-level message sent by the client for the Version method.

VersionResponse

The top-level message sent by the server for the Version method.

Fields

version (string ) Version of the server handling these requests.

Enums

AudioCodec

The encoding of the audio data to be sent for synthesis.

Name	Number	Description
AUDIO_CODEC_UNSPECIFIED	0	AUDIO_CODEC_UNSPECIFIED is the default value of this type.
AUDIO_CODEC_RAW	2	Raw data without any headers
AUDIO_CODEC_WAV	1	WAV with RIFF headers

AudioEncoding

The encoding of the audio data to be sent for synthesis.

Name	Number	Description
AUDIO_ENCODING_UNSPECIFIED	0	AUDIO_ENCODING_UNSPECIFIED is the default value of this type and will result in an error.
AUDIO_ENCODING_SIGNED	1	PCM signed-integer
AUDIO_ENCODING_UNSIGNED	2	PCM unsigned-integer
AUDIO_ENCODING_IEEE_FLOAT	3	PCM IEEE-Float
AUDIO_ENCODING_ULAW	4	G.711 mu-law
AUDIO_ENCODING_ALAW	5	G.711 a-law

ByteOrder

Byte order of multi-byte data

Name	Number	Description
BYTE_ORDER_UNSPECIFIED	0	BYTE_ORDER_UNSPECIFIED is the default value of this type.
BYTE_ORDER_LITTLE_ENDIAN	1	Little Endian byte order
BYTE_ORDER_BIG_ENDIAN	2	Big Endian byte order

PhoneSet

PhoneSet is a set of phonemes for words pronunciation.

Name	Number	Description
PHONE_SET_UNSPECIFIED	0	PHONE_SET_UNSPECIFIED is the default value of this type.
PHONE_SET_IPA	1	IPA phoneme set
PHONE_SET_XSAMPA	2	X-SAMPA phoneme set
PHONE_SET_ARPABET	3	ARPAbet phoneme set

Scalar Value Types

.proto Type	C++ Type	C# Type	Go Type	Java Type	PHP Type	Python Type	Ruby Type
double	double	double	float64	double	float	float	Float
float	float	float	float32	float	float	float	Float
int32	int32	int	int32	int	integer	int	Bignum or Fixnum (as required)
int64	int64	long	int64	long	integer/string	int/long	Bignum
uint32	uint32	uint	uint32	int	integer	int/long	Bignum or Fixnum (as required)
uint64	uint64	ulong	uint64	long	integer/string	int/long	Bignum or Fixnum (as required)
sint32	int32	int	int32	int	integer	int	Bignum or Fixnum (as required)
sint64	int64	long	int64	long	integer/string	int/long	Bignum
fixed32	uint32	uint	uint32	int	integer	int	Bignum or Fixnum (as required)
fixed64	uint64	ulong	uint64	long	integer/string	int/long	Bignum
sfixed32	int32	int	int32	int	integer	int	Bignum or Fixnum (as required)
sfixed64	int64	long	int64	long	integer/string	int/long	Bignum
bool	bool	bool	bool	boolean	boolean	boolean	TrueClass/FalseClass
string	string	string	string	String	string	str/unicode	String (UTF-8)
bytes	string	ByteString	[]byte	ByteString	string	str	String (ASCII-8BIT)

API Reference

Table of Contents

VoiceGenService

Version

ListModels

StreamingSynthesize

Messages

AudioFormat

Fields

ListModelsRequest

ListModelsResponse

Fields

ModelAttributes

Fields

ModelFeatures

Fields

ModelInfo

Fields

SpeakerAttributes

Fields

SpeakerInfo

Fields

StreamingSynthesizeRequest

Fields

StreamingSynthesizeResponse

Fields

SynthesisConfig

Fields

SynthesisText

Fields

SynthesizedAudio

Fields

VersionRequest

VersionResponse

Fields

Enums

AudioCodec

AudioEncoding

ByteOrder

PhoneSet

Scalar Value Types