API Reference
The API is defined as a protobuf spec, so native bindings can be generated in any language with gRPC support. We recommend using buf to generate the bindings.
This section of the documentation is auto-generated from the protobuf spec. The service contains the methods that can be called, and the “messages” are the data structures (objects, classes or structs in the generated code, depending on the language) passed to and from the methods.
TranscribeService
Service that implements the Cobalt Transcribe Speech Recognition API.
Version
Version(VersionRequest) VersionResponse
Queries the version of the server.
ListModels
ListModels(ListModelsRequest) ListModelsResponse
Retrieves a list of available speech recognition models.
StreamingRecognize
StreamingRecognize(StreamingRecognizeRequest) StreamingRecognizeResponse
Performs bidirectional streaming speech recognition. Receive results while sending audio. This method is only available via GRPC and not via HTTP+JSON. However, a web browser may use websockets to use this service.
CompileContext
CompileContext(CompileContextRequest) CompileContextResponse
Compiles recognition context information, such as a specialized list of
words or phrases, into a compact, efficient form to send with subsequent
StreamingRecognize requests to customize speech recognition. For example,
a list of contact names may be compiled in a mobile app and sent with each
recognition request so that the app user’s contact names are more likely to
be recognized than arbitrary names. This pre-compilation ensures that there
is no added latency for the recognition request. It is important to note
that in order to compile context for a model, that model has to support
context in the first place, which can be verified by checking its
ModelAttributes.ContextInfo obtained via the ListModels method. Also,
the compiled data will be model specific; that is, the data compiled for
one model will generally not be usable with a different model.
Messages
- If two or more fields in a message are labeled oneof, then each method call using that message must have exactly one of the fields populated
- If a field is labeled
repeated, then the generated code will accept an array (or struct, or list depending on the language).
AudioFormatRAW
Details of audio in raw format
Fields
-
encoding (AudioEncoding ) Encoding of the samples. It must be specified explicitly and using the default value of
AUDIO_ENCODING_UNSPECIFIEDwill result in an error. -
bit_depth (uint32 ) Bit depth of each sample (e.g. 8, 16, 24, 32, etc.). This is a required field.
-
byte_order (ByteOrder ) Byte order of the samples. This field must be set to a value other than
BYTE_ORDER_UNSPECIFIEDwhen thebit_depthis greater than 8. -
sample_rate (uint32 ) Sampling rate in Hz. This is a required field.
-
channels (uint32 ) Number of channels present in the audio. E.g.: 1 (mono), 2 (stereo), etc. This is a required field.
CompileContextRequest
The top-level message sent by the client for the CompileContext request. It
contains a list of phrases or words, paired with a context token included in
the model being used. The token specifies a category such as “menu_item”,
“airport”, “contact”, “product_name” etc. The context token is used to
determine the places in the recognition output where the provided list of
phrases or words may appear. The allowed context tokens for a given model can
be found in its ModelAttributes.ContextInfo obtained via the ListModels
method.
Fields
-
model_id (string ) Unique identifier of the model to compile the context information for. The model chosen needs to support context which can be verified by checking its
ModelAttributes.ContextInfoobtained viaListModels. -
token (string ) The token that is associated with the provided list of phrases or words (e.g “menu_item”, “airport” etc.). Must be one of the tokens included in the model being used, which can be retrieved by calling the
ListModelsmethod. -
phrases (ContextPhrase repeated) List of phrases and/or words to be compiled.
CompileContextResponse
The message returned to the client by the CompileContext method.
Fields
- context (CompiledContext ) Context information in a compact form that is efficient for use in subsequent recognition requests. The size of the compiled form will depend on the amount of text that was sent for compilation. For 1000 words it’s generally less than 100 kilobytes.
CompiledContext
Context information in a compact form that is efficient for use in subsequent recognition requests. The size of the compiled form will depend on the amount of text that was sent for compilation. For 1000 words it’s generally less than 100 kilobytes.
Fields
- data (bytes )
The context information compiled by the
CompileContextmethod.
ConfusionNetworkArc
An Arc inside a Confusion Network Link
Fields
-
word (string ) Word in the recognized transcript
-
confidence (double ) Confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized.
-
features (ConfusionNetworkArcFeatures ) Features related to this arc
ConfusionNetworkArcFeatures
Features related to confusion network arcs
Fields
- confidence (map ConfusionNetworkArcFeatures.ConfidenceEntry repeated) A map of features that are used for recalculating confidence scores of this confusion network arc
ConfusionNetworkArcFeatures.ConfidenceEntry
Fields
ConfusionNetworkLink
A Link inside a confusion network
Fields
-
start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio received by the recognizer and corresponding to the start of this link
-
duration_ms (uint64 ) Duration in milliseconds of the current link in the confusion network
-
arcs (ConfusionNetworkArc repeated) Arcs between this link
ContextInfo
Model information specifc to supporting recognition context.
Fields
-
supports_context (bool ) If this is set to true, the model supports taking context information into account to aid speech recognition. The information may be sent with with recognition requests via RecognitionContext inside RecognitionConfig.
-
allowed_context_tokens (string repeated) A list of tokens (e.g “name”, “airport” etc.) that serve has placeholders in the model where a client provided list of phrases or words may be used to aid speech recognition and produce the exact desired recognition output.
ContextPhrase
A phrase or word that is to be compiled into context information that can be
later used to improve speech recognition during a StreamingRecognize call.
Along with the phrase or word itself, there is an optional boost parameter
that can be used to boost the likelihood of the phrase or word in the
recognition output.
Fields
-
text (string ) The actual phrase or word.
-
boost (float ) This is an optional field. The boost factor is a positive number which is used to multiply the probability of the phrase or word appearing in the output. This setting can be used to differentiate between similar sounding words, with the desired word given a bigger boost factor.
By default, all phrases or words provided in the
RecongitionContextare given an equal probability of occurring. Boost factors larger than 1 make the phrase or word more probable and boost factors less than 1 make it less likely. A boost factor of 2 corresponds to making the phrase or word twice as likely, while a boost factor of 0.5 means half as likely.
ListModelsRequest
The top-level message sent by the client for the ListModels method.
ListModelsResponse
The message returned to the client by the ListModels method.
Fields
- models (Model repeated) List of models available for use that match the request.
Model
Description of a Transcribe Model
Fields
-
id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for recognition, and is specified in the
RecognitionConfigmessage. -
name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their recognition task.
-
attributes (ModelAttributes ) Model attributes
ModelAttributes
Attributes of a Transcribe Model
Fields
-
sample_rate (uint32 ) Audio sample rate supported by the model
-
context_info (ContextInfo ) Attributes specifc to supporting recognition context.
RecognitionAlternative
A recognition hypothesis
Fields
-
transcript_formatted (string ) Text representing the transcription of the words that the user spoke.
The transcript will be formatted according to the servers formatting configuration. If you want the raw transcript, please see the field
transcript_raw. If the server is configured to not use any formatting, then this field will contain the raw transcript.As an example, if the spoken utterance was “four people”, and the server was configured to format numbers, this field would be set to “4 people”.
-
transcript_raw (string ) Text representing the transcription of the words that the user spoke, without any formatting applied. If you want the formatted transcript, please see the field
transcript_formatted.As an example, if the spoken utterance was
four people, this field would be set to “FOUR PEOPLE”. -
start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio received by the recognizer and corresponding to the start of this utterance.
-
duration_ms (uint64 ) Duration in milliseconds of the current utterance in the spoken audio.
-
confidence (double ) Confidence estimate between 0 and 1. A higher number represents a higher likelihood of the output being correct.
-
word_details (WordDetails ) Word-level details corresponding to the transcripts. This is available only if
enable_word_detailswas set totruein theRecognitionConfig.
RecognitionAudio
Audio to be sent to the recognizer
Fields
- data (bytes )
RecognitionConfig
Configuration for setting up a Recognizer
Fields
-
model_id (string ) Unique identifier of the model to use, as obtained from a
Modelmessage. -
oneof audio_format.audio_format_raw (AudioFormatRAW ) Audio is raw data without any headers
-
oneof audio_format.audio_format_headered (AudioFormatHeadered ) Audio has a self-describing header. Headers are expected to be sent at the beginning of the entire audio file/stream, and not in every
RecognitionAudiomessage.The default value of this type is AUDIO_FORMAT_HEADERED_UNSPECIFIED. If this value is used, the server may attempt to detect the format of the audio. However, it is recommended that the exact format be specified.
-
selected_audio_channels (uint32 repeated) This is an optional field. If the audio has multiple channels, this field can be configured with the list of channel indices that should be considered for the recognition task. These channels are 0-indexed.
Example:
[0]for a mono file,[0, 1]for a stereo file. Example:[1]to only transcribe the second channel of a stereo file.If this field is not set, all the channels in the audio will be processed.
Channels that are present in the audio may be omitted, but it is an error to include a channel index in this field that is not present in the audio. Channels may be listed in any order but the same index may not be repeated in this list.
BAD:
[0, 2]for a stereo file; BAD:[0, 0]for a mono file. -
audio_time_offset_ms (uint64 ) This is an optional field. It can be used to indicate that the audio being streamed to the recognizer is offset from the original stream by the provided duration in milliseconds. This offset will be added to all timestamps in results returned by the recognizer.
The default value of this field is 0ms, so the timestamps in the recognition result will not be modified.
Example use case where this field can be helpful: if a recognition session was interrupted and audio needs to be sent to a new session from the point where the session was previously interrupted, the offset could be set to the point where the interruption had happened.
-
enable_word_details (bool ) This is an optional field. If this is set to
true, each result will include word level details of the transcript. These details are specified in theWordDetailsmessage. If set tofalse, no word-level details will be returned. The default isfalse. -
enable_confusion_network (bool ) This is an optional field. If this is set to true, each result will include a confusion network. If set to
false, no confusion network will be returned. The default isfalse. If the model being used does not support returning a confusion network, this field will have no effect. Tokens in the confusion network always correspond to tokens in thetranscript_rawreturned. -
metadata (RecognitionMetadata ) This is an optional field. If there is any metadata associated with the audio being sent, use this field to provide it to the recognizer. The server may record this metadata when processing the request. The server does not use this field for any other purpose.
-
context (RecognitionContext ) This is an optional field for providing any additional context information that may aid speech recognition. This can also be used to add out-of-vocabulary words to the model or boost recognition of specific proper names or commands. Context information must be pre-compiled via the
CompileContext()method.
RecognitionConfusionNetwork
Confusion network in recognition output
Fields
- links (ConfusionNetworkLink repeated)
RecognitionContext
A collection of additional context information that may aid speech recognition. This can be used to add out-of-vocabulary words to the model or to boost recognition of specific proper names or commands.
Fields
- compiled (CompiledContext repeated)
List of compiled context information, with each entry being compiled from a
list of words or phrases using the
CompileContextmethod.
RecognitionError
Developer-facing error message about a non-fatal recognition issue.
Fields
- message (string )
RecognitionMetadata
Metadata associated with the audio to be recognized.
Fields
- custom_metadata (string ) Any custom metadata that the client wants to associate with the recording. This could be a simple string (e.g. a tracing ID) or structured data (e.g. JSON).
RecognitionResult
A recognition result corresponding to a portion of audio.
Fields
-
alternatives (RecognitionAlternative repeated) An n-best list of recognition hypotheses alternatives
-
is_partial (bool ) If this is set to true, it denotes that the result is an interim partial result, and could change after more audio is processed. If unset, or set to false, it denotes that this is a final result and will not change.
Servers are not required to implement support for returning partial results, and clients should generally not depend on their availability.
-
cnet (RecognitionConfusionNetwork ) If
enable_confusion_networkwas set to true in theRecognitionConfig, and if the model supports it, a confusion network will be available in the results. -
audio_channel (uint32 ) Channel of the audio file that this result was transcribed from. Channels are 0-indexed, so the for mono audio data, this value will always be 0.
StreamingRecognizeRequest
The top-level messages sent by the client for the StreamingRecognize
method. In this streaming call, multiple StreamingRecognizeRequest messages
should be sent. The first message must contain a RecognitionConfig message
only, and all subsequent messages must contain RecognitionAudio only. All
RecognitionAudio messages must contain non-empty audio. If audio content is
empty, the server may choose to interpret it as end of stream and stop
accepting any further messages.
Fields
-
oneof request.config (RecognitionConfig )
-
oneof request.audio (RecognitionAudio )
StreamingRecognizeResponse
The messages returned by the server for the StreamingRecognize request.
Multiple messages of this type will be delivered on the stream, for multiple
results, as soon as results are available from the audio submitted so far. If
the audio has multiple channels, the results of all channels will be
interleaved. Results of each individual channel will be chronological.
However, there is no guarantee of the order of results across channels.
Clients should process both the result and error fields in each message.
At least one of these fields will be present in the message. If both result
and error are present, the result is still valid.
Fields
-
result (RecognitionResult ) A new recognition result. This field will be unset if a new result is not yet available.
-
error (RecognitionError ) A non-fatal error message. If a server encountered a non-fatal error when processing the recognition request, it will be returned in this message. The server will continue to process audio and produce further results. Clients can continue streaming audio even after receiving these messages. This error message is meant to be informational.
An example of when these errors maybe produced: audio is sampled at a lower rate than expected by model, producing possibly less accurate results.
This field will be unset if there is no error to report.
VersionRequest
The top-level message sent by the client for the Version method.
VersionResponse
The message sent by the server for the Version method.
Fields
- version (string ) Version of the server handling these requests.
WordDetails
Fields
-
formatted (WordInfo repeated) Word-level information corresponding to the
transcript_formattedfield. -
raw (WordInfo repeated) Word-level information corresponding to the
transcript_rawfield.
WordInfo
Word level details for recognized words in a transcript
Fields
-
word (string ) The actual word in the text
-
confidence (double ) Confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized.
-
start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio received by the recognizer and corresponding to the start of this spoken word.
-
duration_ms (uint64 ) Duration in milliseconds of the current word in the spoken audio.
Enums
AudioEncoding
The encoding of the audio data to be sent for recognition.
| Name | Number | Description |
|---|---|---|
| AUDIO_ENCODING_UNSPECIFIED | 0 | AUDIO_ENCODING_UNSPECIFIED is the default value of this type and will result in an error. |
| AUDIO_ENCODING_SIGNED | 1 | PCM signed-integer |
| AUDIO_ENCODING_UNSIGNED | 2 | PCM unsigned-integer |
| AUDIO_ENCODING_IEEE_FLOAT | 3 | PCM IEEE-Float |
| AUDIO_ENCODING_ULAW | 4 | G.711 mu-law |
| AUDIO_ENCODING_ALAW | 5 | G.711 a-law |
AudioFormatHeadered
| Name | Number | Description |
|---|---|---|
| AUDIO_FORMAT_HEADERED_UNSPECIFIED | 0 | AUDIO_FORMAT_HEADERED_UNSPECIFIED is the default value of this type. |
| AUDIO_FORMAT_HEADERED_WAV | 1 | WAV with RIFF headers |
| AUDIO_FORMAT_HEADERED_MP3 | 2 | MP3 format with a valid frame header at the beginning of data |
| AUDIO_FORMAT_HEADERED_FLAC | 3 | FLAC format |
| AUDIO_FORMAT_HEADERED_OGG_OPUS | 4 | Opus format with OGG header |
ByteOrder
Byte order of multi-byte data
| Name | Number | Description |
|---|---|---|
| BYTE_ORDER_UNSPECIFIED | 0 | BYTE_ORDER_UNSPECIFIED is the default value of this type. |
| BYTE_ORDER_LITTLE_ENDIAN | 1 | Little Endian byte order |
| BYTE_ORDER_BIG_ENDIAN | 2 | Big Endian byte order |