API Reference

Detailed reference for API requests and types.

The API is defined as a protobuf spec, so native bindings can be generated in any language with gRPC support. We recommend using buf to generate the bindings.

This section of the documentation is auto-generated from the protobuf spec. The service contains the methods that can be called, and the “messages” are the data structures (objects, classes or structs in the generated code, depending on the language) passed to and from the methods.

VoiceBioService

Service that implements the Cobalt VoiceBio API.

Version

Version(VersionRequest) VersionResponse

Returns version information from the server.

ListModels

ListModels(ListModelsRequest) ListModelsResponse

Returns information about the models available on the server.

StreamingEnroll

StreamingEnroll(StreamingEnrollRequest) StreamingEnrollResponse

Uses new audio data to perform enrollment of new users, or to update enrollment of existing users. Returns a new or updated voiceprint.

Clients should store the returned voiceprint against the ID of the user that provided the audio. This voiceprint can be provided later, with the Verify or Identify requests to match new audio against known speakers.

If this call is used to update an existing user’s voiceprint, the old voiceprint can be discarded and only the new one can be stored for that user.

StreamingVerify

StreamingVerify(StreamingVerifyRequest) StreamingVerifyResponse

Compares audio data against the provided voiceprint and verifies whether or not the audio matches against the voiceprint.

StreamingIdentify

StreamingIdentify(StreamingIdentifyRequest) StreamingIdentifyResponse

Compares audio data against the provided list of voiceprints and identifies which (or none) of the voiceprints is a match for the given audio.

VectorizeVoiceprints

VectorizeVoiceprints(VectorizeVoiceprintsRequest) VectorizeVoiceprintsResponse

Converts the given voiceprints into numerical vector representations that can be used for various downstream tasks such as clustering, visualization, or as input features for other machine learning models. The specific format and dimensionality of these vectors may vary depending on the model used.

CompareVoiceprints

CompareVoiceprints(CompareVoiceprintsRequest) CompareVoiceprintsResponse

Compares pre-extracted voiceprints and returns similarity scores and match results without needing to send audio data. This is useful in cases where the user wants to compare a given voiceprint against a large number of other voiceprints, and sending audio data for each comparison would be inefficient. The client can enroll the voiceprint once using the StreamingEnroll method, and then use this method to compare it against a large number of other voiceprints in batches.

Messages

If two or more fields in a message are labeled oneof, then each method call using that message must have exactly one of the fields populated
If a field is labeled repeated, then the generated code will accept an array (or struct, or list depending on the language).

Audio

Audio to be sent to VoiceBio.

Fields

data (bytes )

AudioFormat

Format of the audio to be sent for recognition.

Depending on how they are configured, server instances of this service may not support all the formats provided in the API. One format that is guaranteed to be supported is the RAW format with little-endian 16-bit signed samples with the sample rate matching that of the model being requested.

Fields

oneof audio_format.audio_format_raw (AudioFormatRAW ) Audio is raw data without any headers
oneof audio_format.audio_format_headered (AudioFormatHeadered ) Audio has a self-describing header. Headers are expected to be sent at the beginning of the entire audio file/stream, and not in every Audio message.

The default value of this type is AUDIO_FORMAT_HEADERED_UNSPECIFIED. If this value is used, the server may attempt to detect the format of the audio. However, it is recommended that the exact format be specified.

AudioFormatRAW

Details of audio in raw format

Fields

encoding (AudioEncoding ) Encoding of the samples. It must be specified explicitly and using the default value of AUDIO_ENCODING_UNSPECIFIED will result in an error.
bit_depth (uint32 ) Bit depth of each sample (e.g. 8, 16, 24, 32, etc.). This is a required field.
byte_order (ByteOrder ) Byte order of the samples. This field must be set to a value other than BYTE_ORDER_UNSPECIFIED when the bit_depth is greater than 8.
sample_rate (uint32 ) Sampling rate in Hz. This is a required field.
channels (uint32 ) Number of channels present in the audio. E.g.: 1 (mono), 2 (stereo), etc. This is a required field.

CompareVoiceprintsRequest

The top level message sent by the client for the CompareVoiceprints method. This is similar to StreamingIdentifyRequest, but operates on pre-extracted voiceprints without sending any audio data.

Fields

model_id (string ) ID of the model to use for comparison. The model used for comparison must match with the model used for enrollment of the voiceprints. A list of supported IDs can be found using the ListModels call.
target_voiceprint (Voiceprint ) The voiceprint to compare against the reference voiceprints.
reference_voiceprints (Voiceprint repeated) Voiceprints that should be compared against the target voiceprint.

CompareVoiceprintsResponse

The message returned by the server for the CompareVoiceprints method. This contains the similarity scores and match results for comparing the target voiceprint against each of the reference voiceprints, as well as the index of the best matching voiceprint in the reference list, if any of them is a match. This is similar to StreamingIdentifyResponse, but operates on pre-extracted voiceprints without sending any audio data.

Fields

best_match_index (int32 ) Index (0-based) of the best matching voiceprint in the list of reference voiceprints provided in the CompareVoiceprintsRequest message. If none of the voiceprints was a match, a negative value is returned.
voiceprint_comparison_results (VoiceprintComparisonResult repeated) Result of comparing the given the target voiceprint against each of the reference voiceprints. The order of this list is the same as the reference voiceprint list provided in the CompareVoiceprintsRequest message.

EnrollmentConfig

Configuration for Enrollment of speakers.

Fields

model_id (string ) ID of the model to use for enrollment. A list of supported IDs can be found using the ListModels call.
audio_format (AudioFormat ) Format of the audio to be sent for enrollment.
previous_voiceprint (Voiceprint ) Empty string for new users. For re-enrolling additional users with new audio data, set this to that user’s previous voiceprint. The previous voiceprint needs to have been generated using the same model as specified in this config.

EnrollmentStatus

The message returned as part of StreamingEnrollResponse, to provide information about whether voiceprint is sufficiently trained.

Fields

enrollment_complete (bool ) Whether sufficient data has been provided as part of this user’s enrollment. If this is false, more audio should be collected from the user and re-enrollment should be done. If this is true, it is still OK to enroll more data for the same user to update the voiceprint.
additional_audio_required_seconds (uint32 ) If enrollment is not yet complete, how many more seconds of user’s speech are required to complete the enrollment. If enrollment is completed successfully, this value will be set to 0.

IdentificationConfig

Configuration for Identification of a speaker.

Fields

model_id (string ) ID of the model to use for identification. A list of supported IDs can be found using the ListModels call. The model used for identification must match with the model used for enrollment.
audio_format (AudioFormat ) Format of the audio to be sent for enrollment.
voiceprints (Voiceprint repeated) Voiceprints of potential speakers that need to be identified in the given audio.

ListModelsRequest

The top-level message sent by the client for the ListModels method.

ListModelsResponse

The message returned to the client by the ListModels method.

Fields

models (Model repeated) List of models available for use that match the request.

Model

Description of a VoiceBio model.

Fields

id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for enrollment, verification or identification requests. This ID needs to be specified in the respective config messages for these requests.
name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their voicebio task.
attributes (ModelAttributes ) Model Attributes

ModelAttributes

Attributes of a VoiceBio model

Fields

sample_rate (uint32 ) Audio sample rate (native) supported by the model

StreamingEnrollRequest

The top level messages sent by the client for the StreamingEnroll method. In this streaming call, multiple StreamingEnrollRequest messages should be sent. The first message must contain a EnrollmentConfig message, and all subsequent messages must contain Audio only. All Audio messages must contain non-empty audio. If audio content is empty, the server may choose to interpret it as end of stream and stop accepting any further messages.

Fields

oneof request.config (EnrollmentConfig )
oneof request.audio (Audio )

StreamingEnrollResponse

The message returned by the server for the StreamingEnroll method.

Fields

voiceprint (Voiceprint )
enrollment_status (EnrollmentStatus )

StreamingIdentifyRequest

The top level messages sent by the client for the StreamingIdentify method. In this streaming call, multiple StreamingIdentifyRequest messages should be sent. The first message must contain a IdentificationConfig message, and all subsequent messages must contain Audio only. All Audio messages must contain non-empty audio. If audio content is empty, the server may choose to interpret it as end of stream and stop accepting any further messages.

Fields

oneof request.config (IdentificationConfig )
oneof request.audio (Audio )

StreamingIdentifyResponse

The message returned by the server for the StreamingIdentify method.

Fields

best_match_index (int32 ) Index (0-based) of the best matching voiceprint in the list of input voiceprints provided in the IdentificationConfig message. If none of the voiceprints was a match, a negative value is returned.
voiceprint_comparison_results (VoiceprintComparisonResult repeated) Result of comparing the given audio against each of the input voiceprints. The order of this list is the same as the input voiceprint list provided in the IdentificationConfig message.

StreamingVerifyRequest

The top level messages sent by the client for the StreamingVerify method. In this streaming call, multiple StreamingVerifyRequest messages should be sent. The first message must contain a VerificationConfig message, and all subsequent messages must contain Audio only. All Audio messages must contain non-empty audio. If audio content is empty, the server may choose to interpret it as end of stream and stop accepting any further messages.

Fields

oneof request.config (VerificationConfig )
oneof request.audio (Audio )

StreamingVerifyResponse

The message returned by the server for the StreamingVerify method.

Fields

result (VoiceprintComparisonResult )

VectorVoiceprint

Voiceprint represented in vector form. The specific format and dimensionality of this vector may vary depending on the model used. The VectorizeVoiceprints method can be used to convert a Voiceprint into a VectorVoiceprint representation.

Fields

data (float repeated) List of floating point values representing the voiceprint in vector form.

VectorizeVoiceprintsRequest

The top level message sent by the client for the VectorizeVoiceprints method.

Fields

model_id (string ) ID of the model to use for vectorization. The model used for vectorization must match with the model used for enrollment of the voiceprints. A list of supported IDs can be found using the ListModels call.
voiceprints (Voiceprint repeated) Voiceprints to be vectorized.

VectorizeVoiceprintsResponse

The message returned by the server for the VectorizeVoiceprints method.

Fields

voiceprints (VectorVoiceprint repeated) Voiceprint data converted into a vector representation, which can be used for various downstream tasks such as clustering, visualization, or as input features for other machine learning models. The specific format and dimensionality of these vectors may vary depending on the model used.

The order of this list is the same as the input voiceprint list provided in the VectorizeVoiceprintsRequest message.

VerificationConfig

Configuration for Verification of a speaker.

Fields

model_id (string ) ID of the model to use for verification. A list of supported IDs can be found using the ListModels call. The model used for verification must match with the model used for enrollment.
audio_format (AudioFormat ) Format of the audio to be sent for enrollment.
voiceprint (Voiceprint ) Voiceprint with which audio should be compared.

VersionRequest

The top-level message sent by the client for the Version method.

VersionResponse

The message sent by the server for the Version method.

Fields

version (string ) Version of the server handling these requests.

Voiceprint

Voiceprint extracted from user’s audio.

Fields

data (string ) Voiceprint data serialized to a string.

VoiceprintComparisonResult

Message describing the result of comparing a voiceprint against given audio.

Fields

is_match (bool ) Whether or not the audio successfully matches with the provided voiceprint.
similarity_score (float ) Similarity score representing how closely the audio matched against the voiceprint. This score could be any negative or positive number. Lower value suggests that the audio and voiceprints are less similar, whereas a higher value indicates more similarity. The is_match field can be used to actually decide if the result should be considered a valid match.

Enums

AudioEncoding

The encoding of the audio data to be sent for recognition.

Name	Number	Description
AUDIO_ENCODING_UNSPECIFIED	0	AUDIO_ENCODING_UNSPECIFIED is the default value of this type and will result in an error.
AUDIO_ENCODING_SIGNED	1	PCM signed-integer
AUDIO_ENCODING_UNSIGNED	2	PCM unsigned-integer
AUDIO_ENCODING_IEEE_FLOAT	3	PCM IEEE-Float
AUDIO_ENCODING_ULAW	4	G.711 mu-law
AUDIO_ENCODING_ALAW	5	G.711 a-law

AudioFormatHeadered

Name	Number	Description
AUDIO_FORMAT_HEADERED_UNSPECIFIED	0	AUDIO_FORMAT_HEADERED_UNSPECIFIED is the default value of this type.
AUDIO_FORMAT_HEADERED_WAV	1	WAV with RIFF headers
AUDIO_FORMAT_HEADERED_MP3	2	MP3 format with a valid frame header at the beginning of data
AUDIO_FORMAT_HEADERED_FLAC	3	FLAC format
AUDIO_FORMAT_HEADERED_OGG_OPUS	4	Opus format with OGG header

ByteOrder

Byte order of multi-byte data

Name	Number	Description
BYTE_ORDER_UNSPECIFIED	0	BYTE_ORDER_UNSPECIFIED is the default value of this type.
BYTE_ORDER_LITTLE_ENDIAN	1	Little Endian byte order
BYTE_ORDER_BIG_ENDIAN	2	Big Endian byte order

Scalar Value Types

.proto Type	C++ Type	C# Type	Go Type	Java Type	PHP Type	Python Type	Ruby Type
double	double	double	float64	double	float	float	Float
float	float	float	float32	float	float	float	Float
int32	int32	int	int32	int	integer	int	Bignum or Fixnum (as required)
int64	int64	long	int64	long	integer/string	int/long	Bignum
uint32	uint32	uint	uint32	int	integer	int/long	Bignum or Fixnum (as required)
uint64	uint64	ulong	uint64	long	integer/string	int/long	Bignum or Fixnum (as required)
sint32	int32	int	int32	int	integer	int	Bignum or Fixnum (as required)
sint64	int64	long	int64	long	integer/string	int/long	Bignum
fixed32	uint32	uint	uint32	int	integer	int	Bignum or Fixnum (as required)
fixed64	uint64	ulong	uint64	long	integer/string	int/long	Bignum
sfixed32	int32	int	int32	int	integer	int	Bignum or Fixnum (as required)
sfixed64	int64	long	int64	long	integer/string	int/long	Bignum
bool	bool	bool	bool	boolean	boolean	boolean	TrueClass/FalseClass
string	string	string	string	String	string	str/unicode	String (UTF-8)
bytes	string	ByteString	[]byte	ByteString	string	str	String (ASCII-8BIT)