API Reference

Detailed reference for API requests and types.

The API is defined as a protobuf spec, so native bindings can be generated in any language with gRPC support. We recommend using buf to generate the bindings.

This section of the documentation is auto-generated from the protobuf spec. The service contains the methods that can be called, and the “messages” are the data structures (objects, classes or structs in the generated code, depending on the language) passed to and from the methods.

TranscribeService

Service that implements the Cobalt Transcribe Speech Recognition API.

Version

Version(VersionRequest) VersionResponse

Queries the version of the server.

ListModels

ListModels(ListModelsRequest) ListModelsResponse

Retrieves a list of available speech recognition models.

StreamingRecognize

StreamingRecognize(StreamingRecognizeRequest) StreamingRecognizeResponse

Performs bidirectional streaming speech recognition. Receive results while sending audio. This method is only available via GRPC and not via HTTP+JSON. However, a web browser may use websockets to use this service.

CompileContext

CompileContext(CompileContextRequest) CompileContextResponse

Compiles recognition context information, such as a specialized list of words or phrases, into a compact, efficient form to send with subsequent StreamingRecognize requests to customize speech recognition. For example, a list of contact names may be compiled in a mobile app and sent with each recognition request so that the app user’s contact names are more likely to be recognized than arbitrary names. This pre-compilation ensures that there is no added latency for the recognition request. It is important to note that in order to compile context for a model, that model has to support context in the first place, which can be verified by checking its ModelAttributes.ContextInfo obtained via the ListModels method. Also, the compiled data will be model specific; that is, the data compiled for one model will generally not be usable with a different model.

Messages

If two or more fields in a message are labeled oneof, then each method call using that message must have exactly one of the fields populated
If a field is labeled repeated, then the generated code will accept an array (or struct, or list depending on the language).

AudioFormatRAW

Details of audio in raw format

Fields

encoding (AudioEncoding ) Encoding of the samples. It must be specified explicitly and using the default value of AUDIO_ENCODING_UNSPECIFIED will result in an error.
bit_depth (uint32 ) Bit depth of each sample (e.g. 8, 16, 24, 32, etc.). This is a required field.
byte_order (ByteOrder ) Byte order of the samples. This field must be set to a value other than BYTE_ORDER_UNSPECIFIED when the bit_depth is greater than 8.
sample_rate (uint32 ) Sampling rate in Hz. This is a required field.
channels (uint32 ) Number of channels present in the audio. E.g.: 1 (mono), 2 (stereo), etc. This is a required field.

CompileContextRequest

The top-level message sent by the client for the CompileContext request. It contains a list of phrases or words, paired with a context token included in the model being used. The token specifies a category such as “menu_item”, “airport”, “contact”, “product_name” etc. The context token is used to determine the places in the recognition output where the provided list of phrases or words may appear. The allowed context tokens for a given model can be found in its ModelAttributes.ContextInfo obtained via the ListModels method.

Fields

model_id (string ) Unique identifier of the model to compile the context information for. The model chosen needs to support context which can be verified by checking its ModelAttributes.ContextInfo obtained via ListModels.
token (string ) The token that is associated with the provided list of phrases or words (e.g “menu_item”, “airport” etc.). Must be one of the tokens included in the model being used, which can be retrieved by calling the ListModels method.
phrases (ContextPhrase repeated) List of phrases and/or words to be compiled.

CompileContextResponse

The message returned to the client by the CompileContext method.

Fields

context (CompiledContext ) Context information in a compact form that is efficient for use in subsequent recognition requests. The size of the compiled form will depend on the amount of text that was sent for compilation. For 1000 words it’s generally less than 100 kilobytes.

CompiledContext

Context information in a compact form that is efficient for use in subsequent recognition requests. The size of the compiled form will depend on the amount of text that was sent for compilation. For 1000 words it’s generally less than 100 kilobytes.

Fields

data (bytes ) The context information compiled by the CompileContext method.

ConfusionNetworkArc

An Arc inside a Confusion Network Link

Fields

word (string ) Word in the recognized transcript
confidence (double ) Confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized.
features (ConfusionNetworkArcFeatures ) Features related to this arc

ConfusionNetworkArcFeatures

Features related to confusion network arcs

Fields

confidence (map ConfusionNetworkArcFeatures.ConfidenceEntry repeated) A map of features that are used for recalculating confidence scores of this confusion network arc

ConfusionNetworkArcFeatures.ConfidenceEntry

Fields

key (string )
value (double )

ConfusionNetworkLink

A Link inside a confusion network

Fields

start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio received by the recognizer and corresponding to the start of this link
duration_ms (uint64 ) Duration in milliseconds of the current link in the confusion network
arcs (ConfusionNetworkArc repeated) Arcs between this link

ContextInfo

Model information specifc to supporting recognition context.

Fields

supports_context (bool ) If this is set to true, the model supports taking context information into account to aid speech recognition. The information may be sent with with recognition requests via RecognitionContext inside RecognitionConfig.
allowed_context_tokens (string repeated) A list of tokens (e.g “name”, “airport” etc.) that serve has placeholders in the model where a client provided list of phrases or words may be used to aid speech recognition and produce the exact desired recognition output.

ContextPhrase

A phrase or word that is to be compiled into context information that can be later used to improve speech recognition during a StreamingRecognize call. Along with the phrase or word itself, there is an optional boost parameter that can be used to boost the likelihood of the phrase or word in the recognition output.

Fields

text (string ) The actual phrase or word.
boost (float ) This is an optional field. The boost factor is a positive number which is used to multiply the probability of the phrase or word appearing in the output. This setting can be used to differentiate between similar sounding words, with the desired word given a bigger boost factor.

By default, all phrases or words provided in the RecongitionContext are given an equal probability of occurring. Boost factors larger than 1 make the phrase or word more probable and boost factors less than 1 make it less likely. A boost factor of 2 corresponds to making the phrase or word twice as likely, while a boost factor of 0.5 means half as likely.

ListModelsRequest

The top-level message sent by the client for the ListModels method.

ListModelsResponse

The message returned to the client by the ListModels method.

Fields

models (Model repeated) List of models available for use that match the request.

Model

Description of a Transcribe Model

Fields

id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for recognition, and is specified in the RecognitionConfig message.
name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their recognition task.
attributes (ModelAttributes ) Model attributes

ModelAttributes

Attributes of a Transcribe Model

Fields

sample_rate (uint32 ) Audio sample rate supported by the model
context_info (ContextInfo ) Attributes specifc to supporting recognition context.

RecognitionAlternative

A recognition hypothesis

Fields

transcript_formatted (string ) Text representing the transcription of the words that the user spoke.

The transcript will be formatted according to the servers formatting configuration. If you want the raw transcript, please see the field transcript_raw. If the server is configured to not use any formatting, then this field will contain the raw transcript.

As an example, if the spoken utterance was “four people”, and the server was configured to format numbers, this field would be set to “4 people”.
transcript_raw (string ) Text representing the transcription of the words that the user spoke, without any formatting applied. If you want the formatted transcript, please see the field transcript_formatted.

As an example, if the spoken utterance was four people, this field would be set to “FOUR PEOPLE”.
start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio received by the recognizer and corresponding to the start of this utterance.
duration_ms (uint64 ) Duration in milliseconds of the current utterance in the spoken audio.
confidence (double ) Confidence estimate between 0 and 1. A higher number represents a higher likelihood of the output being correct.
word_details (WordDetails ) Word-level details corresponding to the transcripts. This is available only if enable_word_details was set to true in the RecognitionConfig.

RecognitionAudio

Audio to be sent to the recognizer

Fields

data (bytes )

RecognitionConfig

Configuration for setting up a Recognizer

Fields

model_id (string ) Unique identifier of the model to use, as obtained from a Model message.
oneof audio_format.audio_format_raw (AudioFormatRAW ) Audio is raw data without any headers
oneof audio_format.audio_format_headered (AudioFormatHeadered ) Audio has a self-describing header. Headers are expected to be sent at the beginning of the entire audio file/stream, and not in every RecognitionAudio message.

The default value of this type is AUDIO_FORMAT_HEADERED_UNSPECIFIED. If this value is used, the server may attempt to detect the format of the audio. However, it is recommended that the exact format be specified.
selected_audio_channels (uint32 repeated) This is an optional field. If the audio has multiple channels, this field can be configured with the list of channel indices that should be considered for the recognition task. These channels are 0-indexed.

Example: [0] for a mono file, [0, 1] for a stereo file. Example: [1] to only transcribe the second channel of a stereo file.

If this field is not set, all the channels in the audio will be processed.

Channels that are present in the audio may be omitted, but it is an error to include a channel index in this field that is not present in the audio. Channels may be listed in any order but the same index may not be repeated in this list.

BAD: [0, 2] for a stereo file; BAD: [0, 0] for a mono file.
audio_time_offset_ms (uint64 ) This is an optional field. It can be used to indicate that the audio being streamed to the recognizer is offset from the original stream by the provided duration in milliseconds. This offset will be added to all timestamps in results returned by the recognizer.

The default value of this field is 0ms, so the timestamps in the recognition result will not be modified.

Example use case where this field can be helpful: if a recognition session was interrupted and audio needs to be sent to a new session from the point where the session was previously interrupted, the offset could be set to the point where the interruption had happened.
enable_word_details (bool ) This is an optional field. If this is set to true, each result will include word level details of the transcript. These details are specified in the WordDetails message. If set to false, no word-level details will be returned. The default is false.
enable_confusion_network (bool ) This is an optional field. If this is set to true, each result will include a confusion network. If set to false, no confusion network will be returned. The default is false. If the model being used does not support returning a confusion network, this field will have no effect. Tokens in the confusion network always correspond to tokens in the transcript_raw returned.
metadata (RecognitionMetadata ) This is an optional field. If there is any metadata associated with the audio being sent, use this field to provide it to the recognizer. The server may record this metadata when processing the request. The server does not use this field for any other purpose.
context (RecognitionContext ) This is an optional field for providing any additional context information that may aid speech recognition. This can also be used to add out-of-vocabulary words to the model or boost recognition of specific proper names or commands. Context information must be pre-compiled via the CompileContext() method.

RecognitionConfusionNetwork

Confusion network in recognition output

Fields

links (ConfusionNetworkLink repeated)

RecognitionContext

A collection of additional context information that may aid speech recognition. This can be used to add out-of-vocabulary words to the model or to boost recognition of specific proper names or commands.

Fields

compiled (CompiledContext repeated) List of compiled context information, with each entry being compiled from a list of words or phrases using the CompileContext method.

RecognitionError

Developer-facing error message about a non-fatal recognition issue.

Fields

message (string )

RecognitionMetadata

Metadata associated with the audio to be recognized.

Fields

custom_metadata (string ) Any custom metadata that the client wants to associate with the recording. This could be a simple string (e.g. a tracing ID) or structured data (e.g. JSON).

RecognitionResult

A recognition result corresponding to a portion of audio.

Fields

alternatives (RecognitionAlternative repeated) An n-best list of recognition hypotheses alternatives
is_partial (bool ) If this is set to true, it denotes that the result is an interim partial result, and could change after more audio is processed. If unset, or set to false, it denotes that this is a final result and will not change.

Servers are not required to implement support for returning partial results, and clients should generally not depend on their availability.
cnet (RecognitionConfusionNetwork ) If enable_confusion_network was set to true in the RecognitionConfig, and if the model supports it, a confusion network will be available in the results.
audio_channel (uint32 ) Channel of the audio file that this result was transcribed from. Channels are 0-indexed, so the for mono audio data, this value will always be 0.

StreamingRecognizeRequest

The top-level messages sent by the client for the StreamingRecognize method. In this streaming call, multiple StreamingRecognizeRequest messages should be sent. The first message must contain a RecognitionConfig message only, and all subsequent messages must contain RecognitionAudio only. All RecognitionAudio messages must contain non-empty audio. If audio content is empty, the server may choose to interpret it as end of stream and stop accepting any further messages.

Fields

oneof request.config (RecognitionConfig )
oneof request.audio (RecognitionAudio )

StreamingRecognizeResponse

The messages returned by the server for the StreamingRecognize request. Multiple messages of this type will be delivered on the stream, for multiple results, as soon as results are available from the audio submitted so far. If the audio has multiple channels, the results of all channels will be interleaved. Results of each individual channel will be chronological. However, there is no guarantee of the order of results across channels.

Clients should process both the result and error fields in each message. At least one of these fields will be present in the message. If both result and error are present, the result is still valid.

Fields

result (RecognitionResult ) A new recognition result. This field will be unset if a new result is not yet available.
error (RecognitionError ) A non-fatal error message. If a server encountered a non-fatal error when processing the recognition request, it will be returned in this message. The server will continue to process audio and produce further results. Clients can continue streaming audio even after receiving these messages. This error message is meant to be informational.

An example of when these errors maybe produced: audio is sampled at a lower rate than expected by model, producing possibly less accurate results.

This field will be unset if there is no error to report.

VersionRequest

The top-level message sent by the client for the Version method.

VersionResponse

The message sent by the server for the Version method.

Fields

version (string ) Version of the server handling these requests.

WordDetails

Fields

formatted (WordInfo repeated) Word-level information corresponding to the transcript_formatted field.
raw (WordInfo repeated) Word-level information corresponding to the transcript_raw field.

WordInfo

Word level details for recognized words in a transcript

Fields

word (string ) The actual word in the text
confidence (double ) Confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized.
start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio received by the recognizer and corresponding to the start of this spoken word.
duration_ms (uint64 ) Duration in milliseconds of the current word in the spoken audio.

Enums

AudioEncoding

The encoding of the audio data to be sent for recognition.

Name	Number	Description
AUDIO_ENCODING_UNSPECIFIED	0	AUDIO_ENCODING_UNSPECIFIED is the default value of this type and will result in an error.
AUDIO_ENCODING_SIGNED	1	PCM signed-integer
AUDIO_ENCODING_UNSIGNED	2	PCM unsigned-integer
AUDIO_ENCODING_IEEE_FLOAT	3	PCM IEEE-Float
AUDIO_ENCODING_ULAW	4	G.711 mu-law
AUDIO_ENCODING_ALAW	5	G.711 a-law

AudioFormatHeadered

Name	Number	Description
AUDIO_FORMAT_HEADERED_UNSPECIFIED	0	AUDIO_FORMAT_HEADERED_UNSPECIFIED is the default value of this type.
AUDIO_FORMAT_HEADERED_WAV	1	WAV with RIFF headers
AUDIO_FORMAT_HEADERED_MP3	2	MP3 format with a valid frame header at the beginning of data
AUDIO_FORMAT_HEADERED_FLAC	3	FLAC format
AUDIO_FORMAT_HEADERED_OGG_OPUS	4	Opus format with OGG header

ByteOrder

Byte order of multi-byte data

Name	Number	Description
BYTE_ORDER_UNSPECIFIED	0	BYTE_ORDER_UNSPECIFIED is the default value of this type.
BYTE_ORDER_LITTLE_ENDIAN	1	Little Endian byte order
BYTE_ORDER_BIG_ENDIAN	2	Big Endian byte order

Scalar Value Types

.proto Type	C++ Type	C# Type	Go Type	Java Type	PHP Type	Python Type	Ruby Type
double	double	double	float64	double	float	float	Float
float	float	float	float32	float	float	float	Float
int32	int32	int	int32	int	integer	int	Bignum or Fixnum (as required)
int64	int64	long	int64	long	integer/string	int/long	Bignum
uint32	uint32	uint	uint32	int	integer	int/long	Bignum or Fixnum (as required)
uint64	uint64	ulong	uint64	long	integer/string	int/long	Bignum or Fixnum (as required)
sint32	int32	int	int32	int	integer	int	Bignum or Fixnum (as required)
sint64	int64	long	int64	long	integer/string	int/long	Bignum
fixed32	uint32	uint	uint32	int	integer	int	Bignum or Fixnum (as required)
fixed64	uint64	ulong	uint64	long	integer/string	int/long	Bignum
sfixed32	int32	int	int32	int	integer	int	Bignum or Fixnum (as required)
sfixed64	int64	long	int64	long	integer/string	int/long	Bignum
bool	bool	bool	bool	boolean	boolean	boolean	TrueClass/FalseClass
string	string	string	string	String	string	str/unicode	String (UTF-8)
bytes	string	ByteString	[]byte	ByteString	string	str	String (ASCII-8BIT)