This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Transcribe

Low latency, high accuracy on-prem / on-cloud solutions to your ASR needs.

Web Demo

Cobalt’s Transcribe engine is a state-of-the-art speech recognition system. Cobalt Transcribe supports two different DNN architectures:

  • Hybrid models combine separately tunable Acoustic Models, Lexicons, and Language Models, making them highly customizable for specific use cases. Hybrid models support extremely low-latency partial results.

  • End-to-end models go straight from sounds to words in the same DNN. They tend to be more accurate for general use cases, particularly for systems in which sub-second response time is not required.

Cobalt Transcribe is a highly flexible system that can run on-premise, in your private cloud, or fully embedded on your device. Your data – both the audio and the transcripts – never leave your control.

The SDK is based on a gRPC API and client code can be easily generated for different languages using the proto definition, including C++, C#, Go, Java and Python, and can add support for more languages as required.

Once running, Transcribe’s API provides a method to which you can stream audio. This audio can either be from a microphone or a file. We recommend uncompressed WAV as the encoding, but support other formats such as MP3, ulaw etc.

Cubic setup

Transcribe’s API provides a number of options for returning the speech recognition results. The results are passed back using Google’s protobuf library, allowing them to be handled natively by your application. Transcribe can estimate its confidence in the transcription result at the word or utterance level, along with timestamps of the words. Confidence scores are in the range 0-1. Transcribe’s output options are described below.

Automatic Transcription Results

The simplest result that Transcribe returns is its best guess at the transcription of your audio. Transcribe recognizes the audio you are streaming, listens for the end of each utterance, and returns the speech recognition result.

Transcribe maintains its transcriptions in an N-best list, i.e. is the top N transcriptions from the recognizer. The best ASR result is the first entry in this list.

Click here to see an example json representation of Transcribe’s N-best list with utterance-level confidence scores
{
  "alternatives": [
    {
      "transcript": "TOMORROW IS A NEW DAY",
      "confidence": 0.514
    },
    {
      "transcript": "TOMORROW IS NEW DAY",
      "confidence": 0.201
    },
    {
      "transcript": "TOMORROW IS A <UNK> DAY",
      "confidence": 0.105
    },
    {
      "transcript": "TOMORROW IS ISN'T NEW DAY",
      "confidence": 0.093
    },
    {
      "transcript": "TOMORROW IS A YOUR DAY",
      "confidence": 0.087
    }
  ],
}

A single stream may consist of multiple utterances separated by silence. Transcribe handles each utterance separately.

For longer utterances, it is often useful to see the partial speech recognition results while the audio is being streamed. For example, this allows you to see what the ASR system is predicting in real-time while someone is speaking. Transcribe supports both partial and final ASR results.

Confusion Network

A Confusion Network is a form of speech recognition output that’s been turned into a compact graph representation of many possible transcriptions, as here:

Confusion Network Example

Note that <eps> in this representation is silence.

Click here to see an example json representation of this Confusion Network object, with time stamps and word-level confidence scores
{
  "cnet": {
    "links": [
      {
        "duration": "1.350s",
        "arcs": [
          {
            "word": "<eps>",
            "confidence": 1.0
          }
        ],
        "startTime": "0s"
      },
      {
        "duration": "0.690s",
        "arcs": [
          {
            "word": "TOMORROW",
            "confidence": 1.0
          }
        ],
        "startTime": "1.350s"
      },
      {
        "duration": "0.080s",
        "arcs": [
          {
            "word": "<eps>",
            "confidence": 1.0
          }
        ],
        "startTime": "2.040s"
      },
      {
        "duration": "0.168s",
        "arcs": [
          {
            "word": "IS",
            "confidence": 0.892
          },
          {
            "word": "<eps>",
            "confidence": 0.108
          }
        ],
        "startTime": "2.120s"
      },
      {
        "duration": "0.010s",
        "arcs": [
          {
            "word": "<eps>",
            "confidence": 1.0
          }
        ],
        "startTime": "2.288s"
      },
      {
        "duration": "0.093s",
        "arcs": [
          {
            "word": "A",
            "confidence": 0.620
          },
          {
            "word": "<eps>",
            "confidence": 0.233
          },
          {
            "word": "ISN'T",
            "confidence": 0.108
          },
          {
            "word": "THE",
            "confidence": 0.039
          }
        ],
        "startTime": "2.298s"
      },
      {
        "duration": "0.005s",
        "arcs": [
          {
            "word": "<eps>",
            "confidence": 1.0
          }
        ],
        "startTime": "2.391s"
      },
      {
        "duration": "0.273s",
        "arcs": [
          {
            "word": "NEW",
            "confidence": 0.661
          },
          {
            "word": "<UNK>",
            "confidence": 0.129
          },
          {
            "word": "YOUR",
            "confidence": 0.107
          },
          {
            "word": "YOU",
            "confidence": 0.102
          }
        ],
        "startTime": "2.396s"
      },
      {
        "duration": "0s",
        "arcs": [
          {
            "word": "<eps>",
            "confidence": 1.0
          }
        ],
        "startTime": "2.670s"
      },
      {
        "duration": "0.420s",
        "arcs": [
          {
            "word": "DAY",
            "confidence": 0.954
          },
          {
            "word": "TODAY",
            "confidence": 0.044
          },
          {
            "word": "<UNK>",
            "confidence": 0.002
          }
        ],
        "startTime": "2.670s"
      },
      {
        "duration": "0.270s",
        "arcs": [
          {
            "word": "<eps>",
            "confidence": 1.0
          }
        ],
        "startTime": "3.090s"
      }
    ]
  }
}

Formatted output

Many speech recognition systems typically output raw words exactly as spoken, without any formatting which can improve intelligibility. Cobalt Transcribe’s customizable formatting suite enables a variety of intelligent formatting options:

  • Capitalizing the first letter of the utterance
  • Numbers: “cobalt’s atomic number is twenty seven” -> “Cobalt’s atomic number is 27”
  • TrueCasing: “the iphone was launched in two thousand and seven” -> “The iPhone was launched in 2007”
  • Ordinals: “summer solstice is twenty first june” -> “Summer solstice is 21st June”

1 - Getting Started

How to get a Transcribe Server running on your system

Using Cobalt Transcribe

  • A typical Transcribe release, provided as a compressed archive, will contain a linux binary (transcribe-server) for the required native CPU architecture, appropriate Dockerfile and models.

  • Cobalt Transcribe runs either locally on linux or using Docker.

  • Cobalt Transcribe will serve the Transcribe GRPC API on port 2727. A web demo will be enabled on port 8080.

  • To quickly try out Transcribe, first start the server as shown below and open the web demo at http://localhost:8080 in your browser to send live microphone input or upload an audio file for transcription. You can also use the SDK to use Transcribe from within your application or just command line.

Running Transcribe Server Locally on Linux

./transcribe-server
  • By default, the binary assumes the presence of a configuration file, located in the same directory, named: transcribe-server.cfg.toml. A different config file may be specified using the --config argument.

Running Transcribe Server as a Docker Container

To build and run the Docker image for Transcribe, run:

docker build -t cobalt-transcribe .
docker run -p 2727:2727 -p 8080:8080 cobalt-transcribe

How to Get a Copy of the Transcribe Server and Models

Please contact us for finding a product release or Transcribe model best suited to your requirements.

The demo release you will receive is a compressed archive (tar.bz2) and is structured accordingly:

release.tar.bz2
├── COPYING
├── README.md
├── transcribe-server
├── transcribe-server.cfg.toml
├── cobalt.license.key
├── Dockerfile
├── models
│   └── en_US-16khz
├── formatters
│   └── en_US-16khz
│
└── cobalt.license.key [ provided separately, needs to be copied over ]
  • The README.md file contains information about this release and instructions for how to start the server on your system.

  • The transcribe-server is the server program which is configured using the transcribe-server.cfg.toml file.

  • The Dockerfile can be used to create a container that will let you run Transcribe server on non-linux systems such as MacOS and Windows.

  • The models and formatters directories contain your speech recognition and text formatting models. The content of these directories will depend on the models you downloaded.

System Requirements

Cobalt Transcribe runs on Linux. You can run it directly as a linux application, or using Docker.

You can evaluate the product on Windows or Linux using Docker Desktop but we would not recommend this setup for use in a production environment.

A Cobalt Transcribe release typically includes a single Transcribe model together with binaries and config files. The general purpose Transcribe models take up to 4GB disk space, and need a minimum of 4GB RAM when evaluating locally. For production workloads, we recommend configuring containerized applications with each instance allocated with 4 CPUs and 8GB RAM.

Cobalt Transcribe runs on x86_64 CPUs. We also support Arm64 CPUs, including processors such as the Graviton (AWS c7g EC2 instances). Transcribe is significantly more cost effective to run on C7g instances compared to similarly sized Intel or AMD processors, and we can provide you an Arm64 release on request.

To integrate Cobalt Transcribe into your application, please follow the next steps to Generate the SDK in a language of your choice.

2 - Running Quick Tests

How to use our prebuilt client application to test Cobalt Transcribe.

In the release package you received, we have bundled a prebuilt application, a command-line client, that you can use to connect to the server you now have running. You can use this client to send files to the server for saving the transcription, either as a stream of text, or as JSON with additional information.

Obtaining the Client Application

In the release package of Cobalt Transcribe, you should find a folder called transcribe-client, that contains client binaries for multiple platforms.

transcribe-client/
└── bin
    ├── darwin_amd64
    │   └── transcribe-client
    └── linux_amd64
        └── transcribe-client

This client is implemented in Go, and the code is available for reference.

If you need to build the client yourself, instead of using the pre-packaged version in the release, you can install it using:

go install github.com/cobaltspeech/examples-go/transcribe/transcribe-client@latest

Running The Client

After you follow the Getting Started instructions, you should have Cobalt Transcribe server running and listening on port 2727 on the local machine.

You can run the transcribe client in various ways:

# Transcribe a file using the default address of the Cobalt Transcribe Server (localhost:2727)
transcribe-client recognize input.wav
# Transcribe a file, pointing to some other server address
transcribe-client recognize input.wav --server host:port
# Transcribe a file using the default server address, and save output as a JSON file.
transcribe-client recognize input.wav --output-json output.json
# (Advanced use): List information about the models available on the server.
transcribe-client list
# (Advanced use): Transcribe a file, get word level timestamps and confidences.
transcribe-client recognize input.wav --output-json output.json --recognition-config '{"enable_word_details": true}'

For more details on the recognition-config struct, please see the API spec

# Getting usage information
transcribe-client --help

3 - Generating SDKs

Gives instructions about how to generate an SDK for your project from the proto API definition.
  • APIs for all Cobalt’s services are defined as a protocol buffer specification or simply a proto file and be found in the cobaltspeech/proto github repository.

  • The proto file allows a developer to auto-generate client SDKs for a number of different programming languages. Step by step instructions for generating your own SDK can be found below.

  • We provide pre-generated SDKs for a couple of languages. You can choose to use these instead of generating your own. These are listed here along with instructions on how to install / import them into your projects.

Pre-generated SDKs

Golang

import transcribepb "github.com/cobaltspeech/go-genproto/cobaltspeech/transcribe/v5"
  • An example client using the above repo can be found here.

Python

  • Pre-generated SDK files for Python can be found in the cobaltspeech/py-genproto repo

  • The Python SDK depends on Python >= 3.5. You may use pip to perform a system-wide install, or use virtualenv for a local install. To use it in your Python project, install it:

pip install --upgrade pip
pip install "git+https://github.com/cobaltspeech/py-genproto"

Generating SDKs

Step 1. Installing buf

  • To work with proto files, we recommend using buf, a user-friendly command line tool that can be configured generate documentation, schemas and SDK code for different languages.
# Latest version as of March 14th, 2023.

COBALT="${HOME}/cobalt"
  mkdir -p "${COBALT}/bin"

VERSION="1.15.1"
URL="https://github.com/bufbuild/buf/releases/download/v${VERSION}/buf-$(uname -s)-$(uname -m)"
  curl -L ${URL} -o "${COBALT}/bin/buf"

# Give executable permissions and adding to $PATH.

chmod +x "${COBALT}/bin/buf"
  export PATH="${PATH}:${COBALT}/bin"
brew install bufbuild/buf/buf

Step 2. Getting proto files

COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/git"

# Change this to where you want to clone the repo to.
PROTO_REPO="${COBALT}/git/proto"

git clone https://github.com/cobaltspeech/proto "${PROTO_REPO}"

Step 3. Generating code

  • The cobaltspeech/proto repo provides a buf.gen.yaml config file to get you started with a couple of languages.

  • Other plugins can be added to the buf.gen.yaml file to generate SDK code for more languages.

  • To generate the SDKs, simply run the following (assuming the buf binary is in your $PATH)

cd "${PROTO_REPO}"

# Removing any previously generated files.
rm -rf ./gen

# Generating code for all proto files inside the `proto` directory.
buf generate proto
  • You should now have a folder called gen inside ${PROTO_REPO} that contains the generated code. The latest version of the transcribe API is v5. You can import / include / copy the generated files into your projects as per the conventions of different languages.
gen
├── ... other languages ...
└── py
└── cobaltspeech
├── ... other services ...
└── transcribe
└── v5
├── transcribe_pb2_grpc.py
├── transcribe_pb2.py
└── transcribe_pb2.pyi
gen
├── ... other languages ...
└── go
   ├── cobaltspeech
   │ ├── ...
   │   └── transcribe
   │   └── v5
   │   ├── transcribe_grpc.pb.go
   │   └── transcribe.pb.go
   └── gw
   └── cobaltspeech
   ├── ...
   └── transcribe
   └── v5
   └── transcribe.pb.gw.go
gen
├── ... other languages ...
└── cpp
   └── cobaltspeech
   ├── ...
   └── transcribe
   └── v5
   ├── transcribe.grpc.pb.cc
   ├── transcribe.grpc.pb.h
   ├── transcribe.pb.cc
   ├── transcribe.pb.h
   ├── transcribe.pb.validate.cc
   └── transcribe.pb.validate.h

Step 4. Installing gPRC and protobuf

  • A couple of gRPC and protobuf dependencies are required along with the code generated above. The method of installing them depends on the programming language being used.
  • These dependencies and the most common way of installing/ / including them are listed below for some chosen languages.
# It is encouraged to this inside a python virtual environment

# to avoid creating version conflicts for other scripts that may

# be using these libraries.

pip install --upgrade protobuf
pip install --upgrade grpcio
pip install --upgrade google-api-python-client
go get google.golang.org/protobuf
go get google.golang.org/grpc
go get google.golang.org/genproto
# More details on grpc installation can be found at:

# https://grpc.io/docs/languages/cpp/quickstart/

COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/git"

# Latest version as of 14th March, 2023.

VERSION="v1.52.0"
GRPC_REPO="${COBALT}/git/grpc-${VERSION}"

git clone \
 --recurse-submodules --depth 1 --shallow-submodules \
 -b "${VERSION}" \
 https://github.com/grpc/grpc ${GRPC_REPO}

cd "${GRPC_REPO}"
mkdir -p cmake/build

# Change this to where you want to install libprotobuf and libgrpc.

# It is encouraged to install gRPC locally as there is no easy way to

# uninstall gRPC after you’ve installed it globally.

INSTALL_DIR="${COBALT}"

cd cmake/build
cmake \
 -DgRPC_INSTALL=ON \
 -DgRPC_BUILD_TESTS=OFF \
 -DCMAKE_INSTALL_PREFIX=${INSTALL_DIR} \
 ../..

make -j
make install

4 - Connecting to the Server

Describes how to connect to a running Cobalt Transcribe server instance.
  • Once you have your Transcribe server up and running, and have generated the SDK for your project, you can connect to a running instance of Transcribe server, by “dialing” a gRPC connection.

  • First, you need to know the address where the server is running: e.g. host:grpc_port. By default, this is localhost:2727 and should be logged to the terminal when you first start Transcribe server as grpcAddr:

2023/03/15 07:54:01 info  {"license":"Copyright © 2015--present. Cobalt Speech and Language, Inc.  For additional details, including information about open source components used in this software, please see the COPYING file bundled with this program."}
2023/03/15 07:54:01 info  {"msg":"reading config file","path":"transcribe-server.cfg.toml"}
2023/03/15 07:54:01 info  {"msg":"version","server":"v5.3.5-b70948b","built":"2023-03-14"}
2023/03/15 07:54:01 info  {"msg":"server initializing"}
2023/03/15 07:54:01 info  {"msg":"server started","grpcAddr":"[::]:8027","httpApiAddr":"[::]:8030","httpOpsAddr":"[::]:8031"}
  • The default binding address and port for the gRPC / http server (bundled webpage demo) can be configured in the transcribe-server config file.

Default Connection

  • The following code snippet connects to the server and queries its version. It connects to the server using an “insecure” gRPC channel. This would be the case if you have just started up a local instance of Transcribe server without TLS enabled.
import grpc
import cobaltspeech.transcribe.v5.transcribe_pb2_grpc as stub
import cobaltspeech.transcribe.v5.transcribe_pb2 as transcribe

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.TranscribeServiceStub(channel)

# Get server version.
versionResp = client.Version(transcribe.VersionRequest())
print(versionResp)
package main

import (
	"context"
	"fmt"
	"os"
	"time"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	transcribepb "github.com/cobaltspeech/go-genproto/cobaltspeech/transcribe/v5"
)

func main() {
	const (
		serverAddress  = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := transcribepb.NewTranscribeServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &transcribepb.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)
}

Connect with TLS

  • In our recommended setup for deployment, TLS is enabled in the gRPC connection, and when connecting to the server, clients validate the server’s SSL certificate to make sure they are talking to the right party. This is similar to how “https” connections work in web browsers.

  • The following snippets show how to connect to a Transcribe Server that has TLS enabled. They use the cobalt’s self-hosted demo server at demo.cobaltspeech.com:2727, but you obviously use your own server instance.

import grpc
import cobaltspeech.transcribe.v5.transcribe_pb2_grpc as stub
import cobaltspeech.transcribe.v5.transcribe_pb2 as transcribe

serverAddress = "demo.cobaltspeech.com:2727"

# Setup a gRPC connection with TLS. You can optionally provide your own
# root certificates and private key to grpc.ssl_channel_credentials()
# for mutually authenticated TLS.
creds = grpc.ssl_channel_credentials()
channel = grpc.secure_channel(serverAddress, creds)
client = stub.TranscribeServiceStub(channel)

# Get server version.
versionResp = client.Version(transcribe.VersionRequest())
print(versionResp)
package main

import (
	"context"
	"crypto/tls"
	"fmt"
	"os"
	"time"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials"

	transcribepb "github.com/cobaltspeech/go-genproto/cobaltspeech/transcribe/v5"
)

func main() {
	const (
		serverAddress  = "demo.cobaltspeech.com:2727"
		connectTimeout = 10 * time.Second
	)

	// Setup a gRPC connection with TLS. You can optionally provide your own
	// root certificates and private key through tls.Config for mutually
	// authenticated TLS.
	tlsCfg := tls.Config{}
	creds := credentials.NewTLS(&tlsCfg)

	ctx, cancel := context.WithTimeout(context.Background(), connectTimeout)
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(creds),
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := transcribepb.NewTranscribeServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &transcribepb.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)
}

Client Authentication

  • In some setups, it may be desired that the server should also validate clients connecting to it and only respond to the ones it can verify. If your Transcribe server is configured to do client authentication, you will need to present the appropriate certificate and key when connecting to it.

  • Please note that in the client-authentication mode, the client will still also verify the server’s certificate, and therefore this setup uses mutually authenticated TLS.

  • The following snippets show how to present client certificates when setting up the credentials. These could then be used in the same way as the examples above to connect to a TLS enabled server.

creds = grpc.ssl_channel_credentials(
  root_certificates=root_certificates,  # PEM certificate as byte string
  private_key=private_key,              # PEM client key as byte string 
  certificate_chain=certificate_chain,  # PEM client certificate as byte string
)
package main

import (
	// ...

	"crypto/tls"
	"crypto/x509"
	"fmt"
	"os"

	// ..
)

func main() {
	// ...

	// Root PEM certificate for validating self-signed server certificate
	var rootCert []byte

	// Client PEM certificate and private key.
	var certPem, keyPem []byte

	caCertPool := x509.NewCertPool()
	if ok := caCertPool.AppendCertsFromPEM(rootCert); !ok {
		fmt.Printf("unable to use given caCert\n")
		os.Exit(1)
	}

	clientCert, err := tls.X509KeyPair(certPem, keyPem)
	if err != nil {
		fmt.Printf("unable to use given client certificate and key: %v\n", err)
		os.Exit(1)
	}

	tlsCfg := tls.Config{
		RootCAs:      caCertPool,
		Certificates: []tls.Certificate{clientCert},
	}

	creds := credentials.NewTLS(&tlsCfg)

	// ...
}

5 - Streaming Recognition

Describes how to stream audio to Transcribe server.
  • The following example shows how to transcribe a audio stream using Transcribe’s StreamingRecognize request. The stream can come from a file on disk or be directly from a microphone in real time.

Streaming from an audio file

  • We support several headered file formats including WAV, MP3, FLAC etc. For more details, please see the protocol buffer specification here.

  • The examples below use a WAV file as input to the streaming recognition. We will query the server for available models and use the first model to transcribe the speech.

import grpc
import cobaltspeech.transcribe.v5.transcribe_pb2_grpc as stub
import cobaltspeech.transcribe.v5.transcribe_pb2 as transcribe

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.TranscribeServiceStub(channel)

# Get server version.
versionResp = client.Version(transcribe.VersionRequest())
print(versionResp)

# Get list of models on the server.
modelResp = client.ListModels(transcribe.ListModelsRequest())
for model in modelResp.models:
    print(model)

# Select a model ID from the list above. Going with the first model
# in this example.
modelID = modelResp.models[0].id

# Set the recognition config. We don't set the audio format and let the
# server auto-detect the format from the file header.
cfg = transcribe.RecognitionConfig(
    model_id=modelID,
)

# Open audio file.
audio = open("test.wav", "rb")

# The first request to the server should only contain the
# recognition configuration. Subsequent requests should contain
# audio bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
    yield transcribe.StreamingRecognizeRequest(config=cfg)
    
    data = audio.read(bufferSize)
    while len(data) > 0:
        yield transcribe.StreamingRecognizeRequest(
          audio=transcribe.RecognitionAudio(data=data),
        )
        data = audio.read(bufferSize)

# We also define a callback function to execute for each response.
# The example below just prints the formatted transcript to stdout.
def processResponse(resp):
    result = resp.result
    hyp = result.alternatives[0]                    # 1-best hypothesis.
    transcript = hyp.transcript_formatted           # Formatted transcript.
    start = hyp.start_time_ms / 1000.0              # Converting to seconds.
    end = start + hyp.duration_ms / 1000.0          # Converting to seconds.
    newLine = "\r" if result.is_partial else "\n\n" # Will not move to new line for partial results.
    print(f"[{start:0.2f}:{end:0.2f}] {transcript}", end=newLine)

# Streaming requests to the server.
for resp in client.StreamingRecognize(stream(cfg, audio)):
    processResponse(resp)
package main

import (
	"context"
	"errors"
	"fmt"
	"io"
	"os"
	"sync"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	transcribe "github.com/cobaltspeech/go-genproto/cobaltspeech/transcribe/v5"
)

func main() {
	const (
		serverAddress = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := transcribe.NewTranscribeServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &transcribe.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)

	// Get list model of models on the server.
	modelResp, err := client.ListModels(ctx, &transcribe.ListModelsRequest{})
	if err != nil {
		fmt.Printf("failed to get model list: %v\n", err)
		os.Exit(1)
	}

	for _, m := range modelResp.Models {
		fmt.Println(m)
	}
	fmt.Println()

	// Selecting the first model.
	cfg := &transcribe.RecognitionConfig{
		ModelId: modelResp.Models[0].Id,
	}

	// Opening audio file.
	audio, err := os.Open("test.wav")
	if err != nil {
		fmt.Printf("failed to open audio file: %v\n", err)
		os.Exit(1)
	}

	defer audio.Close()

	// Starting recognition.
	err = StreamingRecognize(ctx, client, cfg, audio, printTranscript)
	if err != nil {
		fmt.Printf("failed to run streaming recognition: %v\n", err)
		os.Exit(1)
	}
}

// StreamingRecognize wraps the bidirectional streaming API for performing
// speech recognition. It sets up recognition using the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to cubic
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// As results are received from Transcribe server, they will be sent to the
// provided handlerFunc.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
//
// This function returns only after all results have been passed to the
// resultHandler.
func StreamingRecognize(
	ctx context.Context,
	client transcribe.TranscribeServiceClient,
	cfg *transcribe.RecognitionConfig,
	audio io.Reader,
	handlerFunc func(*transcribe.StreamingRecognizeResponse),
) error {
	const (
		streamingBufSize = 1024
	)

	// Creating stream.
	stream, err := client.StreamingRecognize(ctx)
	if err != nil {
		return err
	}

	// There are two concurrent processes going on. We will create a new
	// goroutine to read audio and stream it to the server.  This goroutine
	// will receive results from the stream.  Errors could occur in both
	// go routines.  We therefore setup a channel, errCh, to hold these
	// errors. Both go routines are designed to send up to one error, and
	// return immediately. Therefore we use a buffered channel with a
	// capacity of two.
	errCh := make(chan error, 2)

	// start streaming audio in a separate goroutine
	var wg sync.WaitGroup
	wg.Add(1)
	go func() {
		if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
			// if sendAudio encountered io.EOF, it's only a
			// notification that the stream has closed.  The actual
			// status will be obtained in a subsequent Recv call, in
			// the other goroutine below.  We therefore only forward
			// non-EOF errors.
			errCh <- err
		}

		wg.Done()
	}()

	// Receive results from the stream.
	for {
		in, err := stream.Recv()
		if errors.Is(err, io.EOF) {
			break
		}

		if err != nil {
			errCh <- err
			break
		}

		handlerFunc(in)
	}

	wg.Wait()

	select {
	case err := <-errCh:
		// There may be more than one error in the channel, but it is
		// very likely they are related (e.g. connection reset causing
		// both the send and recv to fail) and we therefore return the
		// first error and discard the other.
		return err
	default:
		return nil
	}
}

// printTranscript is a callback function given to StreamingRecognize method to
// print results that are returned though the gRPC stream.
func printTranscript(resp *transcribe.StreamingRecognizeResponse) {
	if resp.Error != nil {
		fmt.Printf("\n[ERROR] server returned an error: %v\n", resp.Error)
		return
	}

	hyp := resp.Result.Alternatives[0]
	startTime := float32(hyp.StartTimeMs) / 1000.0
	endTime := startTime + float32(hyp.DurationMs)/1000.0

	if resp.Result.IsPartial {
		fmt.Printf("\r[%0.2f:%0.2f] %s", startTime, endTime, hyp.TranscriptFormatted)
	} else {
		fmt.Printf("[%0.2f:%0.2f] %s\n\n", startTime, endTime, hyp.TranscriptFormatted)
	}
}

// sendAudio sends audio to a stream.
func sendAudio(
	stream transcribe.TranscribeService_StreamingRecognizeClient,
	cfg *transcribe.RecognitionConfig,
	audio io.Reader,
	bufSize uint32,
) error {
	// The first message needs to be a config message, and all subsequent
	// messages must be audio messages.

	// Send the recognition config
	if err := stream.Send(&transcribe.StreamingRecognizeRequest{
		Request: &transcribe.StreamingRecognizeRequest_Config{Config: cfg},
	}); err != nil {
		// if this failed, we don't need to CloseSend
		return err
	}

	// Stream the audio.
	buf := make([]byte, bufSize)
	for {
		n, err := audio.Read(buf)
		if n > 0 {
			if err2 := stream.Send(&transcribe.StreamingRecognizeRequest{
				Request: &transcribe.StreamingRecognizeRequest_Audio{
					Audio: &transcribe.RecognitionAudio{Data: buf[:n]},
				},
			}); err2 != nil {
				// if we couldn't Send, the stream has
				// encountered an error and we don't need to
				// CloseSend.
				return err2
			}
		}

		if err != nil {
			// err could be io.EOF, or some other error reading from
			// audio.  In any case, we need to CloseSend, send the
			// appropriate error to errCh and return from the function
			if err2 := stream.CloseSend(); err2 != nil {
				return err2
			}
			if err != io.EOF {
				return err
			}
			return nil

		}
	}
}

Streaming from microphone

  • Streaming audio from microphone input basically requires a reader interface that can provided audio samples recorded from a microphone; typically this requires interaction with system libraries. Another option is to use an external command line tool like sox to record and pipe audio into the client.

  • The examples below use the latter approach by using the rec command provided with sox to record and stream the audio.

#!/usr/bin/env python3

# This example assumes sox is installed on the system and is available
# in the system's PATH variable. Instead of opening a regular file from
# disk, we open a subprocess that executes sox's rec command to record
# audio from the system's default microphone.

import grpc
import cobaltspeech.transcribe.v5.transcribe_pb2_grpc as stub
import cobaltspeech.transcribe.v5.transcribe_pb2 as transcribe
import subprocess

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.TranscribeServiceStub(channel)

# Get server version.
versionResp = client.Version(transcribe.VersionRequest())
print(versionResp)

# Get list of models on the server.
modelResp = client.ListModels(transcribe.ListModelsRequest())
for model in modelResp.models:
    print(model)

# Select a model ID from the list above. Going with the first model
# in this example.
m = modelResp.models[0]
modelID = m.id

# Setting audio format to be raw 16-bit signed little endian audio samples
# recorded at the sample rate expected by the model.
cfg = transcribe.RecognitionConfig(
    model_id=modelID,
	audio_format_raw=transcribe.AudioFormatRAW(
      encoding="AUDIO_ENCODING_SIGNED",
      bit_depth=16,
      byte_order="BYTE_ORDER_LITTLE_ENDIAN",
      sample_rate=m.attributes.sample_rate,
      channels=1,
    )
)

# Open microphone stream using sox's rec command and record
# audio using the config specified above.
cmd = f"rec --no-show-progress -t raw -r {m.attributes.sample_rate} -e signed -b 16 -L -c 1 -"
mic = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
audio = mic.stdout

try:
  _ = audio.read(1024) # Trying to read some bytes as sanity check.
except Exception as err:
  	print(f"[ERROR] failed to read audio from mic stream: {err}")

print("\n[INFO] recording from microphone ... Press ctrl + c to exit\n")

# The first request to the server should only contain the
# recognition configuration. Subsequent requests should contain
# audio bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
    yield transcribe.StreamingRecognizeRequest(config=cfg)
    
    data = audio.read(bufferSize)
    while len(data) > 0:
        yield transcribe.StreamingRecognizeRequest(
          audio=transcribe.RecognitionAudio(data=data),
        )
        data = audio.read(bufferSize)

# We also define a callback function to execute for each response.
# The example below just prints the formatted transcript to stdout.
def processResponse(resp):
    result = resp.result
    hyp = result.alternatives[0]                    # 1-best hypothesis.
    transcript = hyp.transcript_formatted           # Formatted transcript.
    start = hyp.start_time_ms / 1000.0              # Converting to seconds.
    end = start + hyp.duration_ms / 1000.0          # Converting to seconds.
    newLine = "\r" if result.is_partial else "\n\n" # Will not move to new line for partial results.
    print(f"[{start:0.2f}:{end:0.2f}] {transcript}", end=newLine)

# Streaming requests to the server.
try:
  for resp in client.StreamingRecognize(stream(cfg, audio)):
      processResponse(resp)
except KeyboardInterrupt:
	# Stop streaming when ctrl + c pressed.
	pass
except Exception as err:
	print(f"[ERROR] failed to stream audio: {err}")

audio.close()
mic.kill()
package main

import (
	"context"
	"errors"
	"fmt"
	"io"
	"os"
	"os/exec"
	"os/signal"
	"strings"
	"sync"
	"syscall"

	"golang.org/x/sync/errgroup"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	transcribe "github.com/cobaltspeech/go-genproto/cobaltspeech/transcribe/v5"
)

func main() {
	const (
		serverAddress = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := transcribe.NewTranscribeServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &transcribe.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)

	// Get list model of models on the server.
	modelResp, err := client.ListModels(ctx, &transcribe.ListModelsRequest{})
	if err != nil {
		fmt.Printf("failed to get model list: %v\n", err)
		os.Exit(1)
	}

	for _, m := range modelResp.Models {
		fmt.Println(m)
	}
	fmt.Println()

	// Selecting first model.
	m := modelResp.Models[0]

	// Setting audio format to be raw 16-bit signed little endian audio samples
	// recorded at the sample rate expected by the model.
	cfg := &transcribe.RecognitionConfig{
		ModelId: m.Id,
		AudioFormat: &transcribe.RecognitionConfig_AudioFormatRaw{
			AudioFormatRaw: &transcribe.AudioFormatRAW{
				Encoding:   transcribe.AudioEncoding_AUDIO_ENCODING_SIGNED,
				SampleRate: m.Attributes.SampleRate,
				BitDepth:   16,
				ByteOrder:  transcribe.ByteOrder_BYTE_ORDER_LITTLE_ENDIAN,
				Channels:   1,
			},
		},
	}

	// Open microphone stream using sox's rec command and record
	// audio using the config specified above.
	args := fmt.Sprintf("--no-show-progress -t raw -r %d -e signed -b 16 -L -c 1 -", m.Attributes.SampleRate)
	cmd := exec.CommandContext(ctx, "rec", strings.Fields(args)...)

	audio, err := cmd.StdoutPipe()
	if err != nil {
		fmt.Printf("failed to open microphone stream: %v\n", err)
		os.Exit(1)
	}

	// Starting routines to record from microphone and stream to server
	// using an errgroup.Group that returns if either one encounters an error.
	eg, ctx := errgroup.WithContext(ctx)

	eg.Go(func() error {
		fmt.Printf("\n[INFO] recording from microphone ... Press ctrl + c to exit\n")

		if err := cmd.Run(); err != nil {
			return fmt.Errorf("record from microphone: %w", err)
		}

		return nil
	})

	eg.Go(func() error { return StreamingRecognize(ctx, client, cfg, audio, printTranscript) })

	// Also using a routine to monitor for interrupts.
	eg.Go(func() error {
		const maxInterrupts = 10
		interrupt := make(chan os.Signal, maxInterrupts)
		signal.Notify(interrupt, os.Interrupt, syscall.SIGTERM)

		<-interrupt
		cancel()

		return ctx.Err()
	})

	if err := eg.Wait(); err != nil && !errors.Is(err, ctx.Err()) {
		fmt.Printf("failed to run streaming recognition: %v\n", err)
	}
}

// StreamingRecognize wraps the bidirectional streaming API for performing
// speech recognition. It sets up recognition using the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to cubic
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// As results are received from Transcribe server, they will be sent to the
// provided handlerFunc.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
//
// This function returns only after all results have been passed to the
// resultHandler.
func StreamingRecognize(
	ctx context.Context,
	client transcribe.TranscribeServiceClient,
	cfg *transcribe.RecognitionConfig,
	audio io.Reader,
	handlerFunc func(*transcribe.StreamingRecognizeResponse),
) error {
	const (
		streamingBufSize = 1024
	)

	// Creating stream.
	stream, err := client.StreamingRecognize(ctx)
	if err != nil {
		return err
	}

	// There are two concurrent processes going on. We will create a new
	// goroutine to read audio and stream it to the server.  This goroutine
	// will receive results from the stream.  Errors could occur in both
	// go routines.  We therefore setup a channel, errCh, to hold these
	// errors. Both go routines are designed to send up to one error, and
	// return immediately. Therefore we use a buffered channel with a
	// capacity of two.
	errCh := make(chan error, 2)

	// start streaming audio in a separate goroutine
	var wg sync.WaitGroup
	wg.Add(1)
	go func() {
		if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
			// if sendAudio encountered io.EOF, it's only a
			// notification that the stream has closed.  The actual
			// status will be obtained in a subsequent Recv call, in
			// the other goroutine below.  We therefore only forward
			// non-EOF errors.
			errCh <- err
		}

		wg.Done()
	}()

	// Receive results from the stream.
	for {
		in, err := stream.Recv()
		if errors.Is(err, io.EOF) {
			break
		}

		if err != nil {
			errCh <- err
			break
		}

		handlerFunc(in)
	}

	wg.Wait()

	select {
	case err := <-errCh:
		// There may be more than one error in the channel, but it is
		// very likely they are related (e.g. connection reset causing
		// both the send and recv to fail) and we therefore return the
		// first error and discard the other.
		return err
	default:
		return nil
	}
}

// printTranscript is a callback function given to StreamingRecognize method to
// print results that are returned though the gRPC stream.
func printTranscript(resp *transcribe.StreamingRecognizeResponse) {
	if resp.Error != nil {
		fmt.Printf("\n[ERROR] server returned an error: %v\n", resp.Error)
		return
	}

	hyp := resp.Result.Alternatives[0]
	startTime := float32(hyp.StartTimeMs) / 1000.0
	endTime := startTime + float32(hyp.DurationMs)/1000.0

	if resp.Result.IsPartial {
		fmt.Printf("\r[%0.2f:%0.2f] %s", startTime, endTime, hyp.TranscriptFormatted)
	} else {
		fmt.Printf("[%0.2f:%0.2f] %s\n\n", startTime, endTime, hyp.TranscriptFormatted)
	}
}

// sendAudio sends audio to a stream.
func sendAudio(
	stream transcribe.TranscribeService_StreamingRecognizeClient,
	cfg *transcribe.RecognitionConfig,
	audio io.Reader,
	bufSize uint32,
) error {
	// The first message needs to be a config message, and all subsequent
	// messages must be audio messages.

	// Send the recognition config
	if err := stream.Send(&transcribe.StreamingRecognizeRequest{
		Request: &transcribe.StreamingRecognizeRequest_Config{Config: cfg},
	}); err != nil {
		// if this failed, we don't need to CloseSend
		return err
	}

	// Stream the audio.
	buf := make([]byte, bufSize)
	for {
		n, err := audio.Read(buf)
		if n > 0 {
			if err2 := stream.Send(&transcribe.StreamingRecognizeRequest{
				Request: &transcribe.StreamingRecognizeRequest_Audio{
					Audio: &transcribe.RecognitionAudio{Data: buf[:n]},
				},
			}); err2 != nil {
				// if we couldn't Send, the stream has
				// encountered an error and we don't need to
				// CloseSend.
				return err2
			}
		}

		if err != nil {
			// err could be io.EOF, or some other error reading from
			// audio.  In any case, we need to CloseSend, send the
			// appropriate error to errCh and return from the function
			if err2 := stream.CloseSend(); err2 != nil {
				return err2
			}
			if err != io.EOF {
				return err
			}
			return nil

		}
	}
}

6 - Recognition Configurations

Describes how to configure requests to Transcribe server.
  • An in-depth explanation of the methods, data structures and types in the auto-generated SDKs can be found in the API Reference section. The sub-section on the RecognitionConfig object is particularly important here. This page discusses the common combinations of values set in RecognitionConfig sent to the server.

  • First, here’s a quick overview of the fields in RecognitionConfig.

Field Required Default Description
model_id Yes - Unique ID of the model to use.
audio_format_raw Yes for raw audio - Can be used to specify the details of raw audio samples recorded from a microphone stream, for example.
audio_format_headered No UNSPECIFIED Can be used when audio has a self-describing header such as WAV, FLAC, MP3, OPUS etc. If not set, transcribe-server will try to auto-detect the audio encoding from the header.
selected_audio_channels No [0] (mono) Specifies which channels of a multi-channel audio file to be transcribed, each as their own individual audio stream.
selected_audio_channels No [0] (mono) Specifies which channels of a multi-channel audio file to be transcribed, each as their own individual audio stream.
audio_time_offset_ms No 0 Can be used to indicate that the audio being streamed to the recognizer is offset from the original stream by the provided duration in milliseconds. This offset will be added to all timestamps in results returned by the recognizer.
enable_confusion_network No false Toggles the inclusion of a confusion network, consisting of multiple alternative transcriptions. The specified model must also support confusion networks for this field to be populated.
metadata No "" Can be used to send any custom metadata associated with the audio being sent.The server may record this metadata when processing the request. The server does not use this field for any other purpose.
context No nil Can be used to provide any context information that can aid speech recognition, such as probable phrases or words that may appear in the recognition output or even out of vocabulary words for the model being used. Currently all context information must first be pre-compiled via the CompileContext().

Use cases

Transcribing Headered Files

  • The most basic use case is getting a formatted transcript for a headered audio file such foo.wav. This would simply need a config such as the following:
{
    "model_id": "1",
}
  • Transcribe will return one or more results depending on partial result frequency, end points in speech etc. each of which would like the following:
{
  "error": null,
  "result": {
    "alternatives": [
      {
        "transcript_formatted": "Tomorrow is a new day.",
        "transcript_raw": "TOMORROW IS A NEW DAY",
        "start_time_ms": 180,
        "duration_ms": 1425,
        "confidence": 0.870,
      },
      {
        "transcript_formatted": "Tomorrow is a you day.",
        "transcript_raw": "TOMORROW IS A YOU DAY",
        "start_time_ms": 180,
        "duration_ms": 1425,
        "confidence": 0.130,
      }
      // ...
      // Other alternative hypotheses.
      // ...
    ]
  }
}
  • If some sort of non-fatal error was encountered, then Transcribe will populate the error field. One such case maybe sending audio sampled at a lower sample rate than what the model is configured for (e.g. sending 8 kHz audio to a 16 kHz model):
{
  "error": {
    "message": "potential accuracy loss: input sample rate (8000) is lower than required (16000)"
  },
  "results": {
    // ...
    // Results
    // ...
  },
}

Transcribing Raw Audio Stream

  • For transcribing raw audio streams, such as those coming in from a live microphone, the details of the audio samples such as their sampling rate, encoding etc. must be specified in the RecognitionConfig like so:
{
    "model_id": "1",
    "audio_format_raw": {
      encoding="SIGNED",
      bit_depth=16,
      byte_order="LITTLE_ENDIAN",
      sample_rate=16000,
      channels=1,
    }
}
  • For various other encoding formats for raw samples, check AudioFormatRaw in the API specification.

Getting Word-level Details

  • If you need to know the word-level details such as word timestamps, to align subtitles with a video, for example, then you can use the following config to enable those word-level timestamps.
{
    "model_id": "1",
    "enable_word_details": true
}
  • Each alternative hypothesis in the returned results will have a word_details field containing details for both formatted and raw words:
{
  "error": null,
  "result": {
    "alternatives": [
      {
        "transcript_formatted": "Tomorrow is a new day.",
        "transcript_raw": "TOMORROW IS A NEW DAY",
        "start_time_ms": 180,
        "duration_ms": 1425,
        "confidence": 0.870,
        "word_details": {
          "formatted": [
            { "word": "Tomorrow", "confidence": 1.0, "start_time_ms": 180, "duration_ms": 800 },
            { "word": "is", "confidence": 1.0, "start_time_ms": 980, "duration_ms": 120 },
            { "word": "a", "confidence": 1.0, "start_time_ms": 1100, "duration_ms": 120 },
            { "word": "new", "confidence": 0.870, "start_time_ms": 1220, "duration_ms": 210 },
            { "word": "day.", "confidence": 1.0, "start_time_ms": 1450, "duration_ms": 155 },          
          ],
          "raw": [
            { "word": "TOMORROW", "confidence": 1.0, "start_time_ms": 180, "duration_ms": 800 },
            { "word": "IS", "confidence": 1.0, "start_time_ms": 980, "duration_ms": 120 },
            { "word": "A", "confidence": 1.0, "start_time_ms": 1100, "duration_ms": 120 },
            { "word": "NEW", "confidence": 0.870, "start_time_ms": 1220, "duration_ms": 210 },
            { "word": "DAY", "confidence": 1.0, "start_time_ms": 1450, "duration_ms": 155 },          
          ],
        }
      },
      // ...
      // Other alternative hypotheses.
      // ...
    ]
  }
}

Getting Confusion Networks

  • For applications that need more than the one-best transcription, the most comprehensive and detailed results are found in the confusion network. Please refer to the in-depth confusion network documentation to see what is included.

  • To enable the confusion network, the following config can be used:

{
    "model_id": "1",
    "enable_confusion_network": true
}
  • The confusion network will be accessible at the cnet field in the results returned:
{
  "error": null,
  "result": {
    "alternatives": [
        {
          "transcript_formatted": "Tomorrow is a new day.",
          "transcript_raw": "TOMORROW IS A NEW DAY",
          "start_time_ms": 180,
          "duration_ms": 1425,
          "confidence": 0.870,
        },
        {
          "transcript_formatted": "Tomorrow is a you day.",
          "transcript_raw": "TOMORROW IS A YOU DAY",
          "start_time_ms": 180,
          "duration_ms": 1425,
          "confidence": 0.130,
        }
        // ...
        // Other alternative hypotheses.
        // ...
    ],
    "cnet": {
      "links": [
        { 
          "start_time_ms": 180,
          "duration_ms": 800,
          "arcs": [
            { "word": "TOMORROW", "confidence": 1.0 }
          ]
        },
        { 
          "start_time_ms": 980,
          "duration_ms": 120,
          "arcs": [
            { "word": "IS", "confidence": 1.0 }
          ]
        },
        { 
          "start_time_ms": 1100,
          "duration_ms": 120,
          "arcs": [
            { "word": "A", "confidence": 1.0 }
          ]
        },
        { 
          "start_time_ms": 1220,
          "duration_ms": 210,
          "arcs": [
            { "word": "NEW", "confidence": 0.870 },
            { "word": "YOU", "confidence": 0.130 }
          ]
        },
        { 
          "start_time_ms": 1450,
          "duration_ms": 155,
          "arcs": [
            { "word": "DAY", "confidence": 1.0 }
          ]
        }        
      ]
    }
  }
}

7 - Recognition Context

Describes how to provide context information for aiding speech recognition.
  • Cobalt Transcribe allows users to send context information with a recognition request which may aid the speech recognition. For example, if you have a list of names that you want to make sure the Transcribe model recognizes correctly, with the correct spelling, then you may provide the list in the form of a RecognitionContext object along with the RecognitionConfig before streaming data.

  • Transcribe models allow different sets of “context tokens” each of which can be paired with a list of words or phrases. For example, a Transcribe model may have a context token for airport names, and you can provide a list of airport names you want to be recognized correctly for this context token. Likewise, models may also be configured with tokens for “contact list names”, “menu items”, “medical jargon” etc.

To ensure that there is no added latency in processing the list of words or phrases during a recognition request, we have a API method called CompileContext() that allows the user to compile the list into a compact, efficient format for passing to the StreamingRecognize() method.

Compiling Recognition Context

  • The following snippet shows an example of how to compile context data and then send it during a recognition request.
import grpc
import cobaltspeech.transcribe.v5.transcribe_pb2_grpc as stub
import cobaltspeech.transcribe.v5.transcribe_pb2 as transcribe

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.TranscribeServiceStub(channel)

# Get server version.
versionResp = client.Version(transcribe.VersionRequest())
print(versionResp)

# Get list of models on the server. 
modelResp = client.ListModels(transcribe.ListModelsRequest())
for model in modelResp.models:
    print(model)


# Select a model ID from the list above. Going with the first model
# in this example. Also printing list of allowed context tokens.
m = modelResp.models[0]
print(f"context tokens =  {m.attributes.context_info.allowed_context_tokens}")

# Let's say this model has an allowed context token called "airport_names" and
# we have a list of airport names that we want to make sure the recognizer gets
# right. We compile the list of names using the CompileContext(), save the compiled
# data and send it back with subsequent recognize requests to customize and improve
# the results.
#
# More typically, general models have a "catch-all" token called "unk:default" which
# can be used to boost the probabilities of any type word, as well as add words that
# are not in the model's vocabulary.
phrases = ["NARITA", "KUALA LUMPUR INTERNATIONAL", "ISTANBUL ATATURK", "LAGUARDIA"]
token = m.attributes.context_info.allowed_context_tokens[0]  # "unk:default" 

compileReq = transcribe.CompileContextRequest(
    model_id=m.id,
    token=token,
    phrases=[ transcribe.ContextPhrase(text=t) for t in phrases ],
)

# Sending compilation request. 
compiledResp = client.CompileContext(compileReq)

# Saving the compiled result for later use; note this compiled data is only
# compatible with the model whose ID was provided in the CompileContext call
compiledContexts = []
compiledContexts.append(compiledResp.context)

# Set the recognition config. We don't set the audio format and let the
# server auto-detect the format from the file header.
cfg = transcribe.RecognitionConfig(
    model_id=m.id,
    context=transcribe.RecognitionContext(compiled=compiledContexts),
)

# Open audio file.
audio = open("test.wav", "rb")

# The first request to the server should only contain the
# recognition configuration. Subsequent requests should contain
# audio bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
    yield transcribe.StreamingRecognizeRequest(config=cfg)
    
    data = audio.read(bufferSize)
    while len(data) > 0:
        yield transcribe.StreamingRecognizeRequest(
          audio=transcribe.RecognitionAudio(data=data),
        )
        data = audio.read(bufferSize)

# We also define a callback function to execute for each response.
# The example below just prints the formatted transcript to stdout.
def processResponse(resp):
    result = resp.result
    hyp = result.alternatives[0]                    # 1-best hypothesis.
    transcript = hyp.transcript_formatted           # Formatted transcript.
    start = hyp.start_time_ms / 1000.0              # Converting to seconds.
    end = start + hyp.duration_ms / 1000.0          # Converting to seconds.
    newLine = "\r" if result.is_partial else "\n\n" # Will not move to new line for partial results.
    print(f"[{start:0.2f}:{end:0.2f}] {transcript}", end=newLine)

# Streaming requests to the server.
for resp in client.StreamingRecognize(stream(cfg, audio)):
    processResponse(resp)
package main

import (
	"context"
	"errors"
	"fmt"
	"io"
	"log"
	"os"
	"sync"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	transcribe "github.com/cobaltspeech/go-genproto/cobaltspeech/transcribe/v5"
)

func main() {
	const (
		serverAddress = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := transcribe.NewTranscribeServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &transcribe.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)

	// Get list model of models on the server.
	modelResp, err := client.ListModels(ctx, &transcribe.ListModelsRequest{})
	if err != nil {
		fmt.Printf("failed to get model list: %v\n", err)
		os.Exit(1)
	}

	for _, m := range modelResp.Models {
		fmt.Println(m)
	}
	fmt.Println()

	// Select a model ID from the list above. Going with the first model
	// in this example. Also printing list of allowed context tokens.
	m := modelResp.Models[0]
	fmt.Printf("context tokens =  %v\n", m.Attributes.ContextInfo.AllowedContextTokens)

	// Let's say this model has an allowed context token called "airport_names" and
	// we have a list of airport names that we want to make sure the recognizer gets
	// right. We compile the list of names using the CompileContext(), save the compiled
	// data and send it back with subsequent recognize requests to customize and improve
	// the results.
	//
	// More typically, general models have a "catch-all" token called "unk:default" which
	// can be used to boost the probabilities of any type word, as well as add words that
	// are not in the model's vocabulary.
	phrases := []string{"NARITA", "KUALA LUMPUR INTERNATIONAL", "ISTANBUL ATATURK", "LAGUARDIA"}
	token := m.Attributes.ContextInfo.AllowedContextTokens[0] // "unk:default"

	compileReq := &transcribe.CompileContextRequest{
		ModelId: m.Id,
		Token:   token,
		Phrases: make([]*transcribe.ContextPhrase, 0, len(phrases)),
	}

	for _, t := range phrases {
		compileReq.Phrases = append(compileReq.Phrases, &transcribe.ContextPhrase{
			Text: t,
		})
	}

	// Sending compilation request.
	compiledResp, err := client.CompileContext(context.Background(), compileReq)
	if err != nil {
		log.Fatal(err)
	}

	// Saving the compiled result for later use; note this compiled data is only
	// compatible with the model whose ID was provided in the CompileContext call
	compiledContexts := []*transcribe.CompiledContext{compiledResp.Context}

	// Set the recognition config. We don't set the audio format and let the
	// server auto-detect the format from the file header.
	cfg := &transcribe.RecognitionConfig{
		ModelId: m.Id,
		Context: &transcribe.RecognitionContext{
			Compiled: compiledContexts,
		},
	}

	// Opening audio file.
	audio, err := os.Open("test.wav")
	if err != nil {
		fmt.Printf("failed to open audio file: %v\n", err)
		os.Exit(1)
	}

	defer audio.Close()

	// Starting recognition.
	err = StreamingRecognize(ctx, client, cfg, audio, printTranscript)
	if err != nil {
		fmt.Printf("failed to run streaming recognition: %v\n", err)
		os.Exit(1)
	}
}

// StreamingRecognize wraps the bidirectional streaming API for performing
// speech recognition. It sets up recognition using the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to cubic
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// As results are received from Transcribe server, they will be sent to the
// provided handlerFunc.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
//
// This function returns only after all results have been passed to the
// resultHandler.
func StreamingRecognize(
	ctx context.Context,
	client transcribe.TranscribeServiceClient,
	cfg *transcribe.RecognitionConfig,
	audio io.Reader,
	handlerFunc func(*transcribe.StreamingRecognizeResponse),
) error {
	const (
		streamingBufSize = 1024
	)

	// Creating stream.
	stream, err := client.StreamingRecognize(ctx)
	if err != nil {
		return err
	}

	// There are two concurrent processes going on. We will create a new
	// goroutine to read audio and stream it to the server.  This goroutine
	// will receive results from the stream.  Errors could occur in both
	// go routines.  We therefore setup a channel, errCh, to hold these
	// errors. Both go routines are designed to send up to one error, and
	// return immediately. Therefore we use a buffered channel with a
	// capacity of two.
	errCh := make(chan error, 2)

	// start streaming audio in a separate goroutine
	var wg sync.WaitGroup
	wg.Add(1)
	go func() {
		if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
			// if sendAudio encountered io.EOF, it's only a
			// notification that the stream has closed.  The actual
			// status will be obtained in a subsequent Recv call, in
			// the other goroutine below.  We therefore only forward
			// non-EOF errors.
			errCh <- err
		}

		wg.Done()
	}()

	// Receive results from the stream.
	for {
		in, err := stream.Recv()
		if errors.Is(err, io.EOF) {
			break
		}

		if err != nil {
			errCh <- err
			break
		}

		handlerFunc(in)
	}

	wg.Wait()

	select {
	case err := <-errCh:
		// There may be more than one error in the channel, but it is
		// very likely they are related (e.g. connection reset causing
		// both the send and recv to fail) and we therefore return the
		// first error and discard the other.
		return err
	default:
		return nil
	}
}

// printTranscript is a callback function given to StreamingRecognize method to
// print results that are returned though the gRPC stream.
func printTranscript(resp *transcribe.StreamingRecognizeResponse) {
	if resp.Error != nil {
		fmt.Printf("\n[ERROR] server returned an error: %v\n", resp.Error)
		return
	}

	hyp := resp.Result.Alternatives[0]
	startTime := float32(hyp.StartTimeMs) / 1000.0
	endTime := startTime + float32(hyp.DurationMs)/1000.0

	if resp.Result.IsPartial {
		fmt.Printf("\r[%0.2f:%0.2f] %s", startTime, endTime, hyp.TranscriptFormatted)
	} else {
		fmt.Printf("[%0.2f:%0.2f] %s\n\n", startTime, endTime, hyp.TranscriptFormatted)
	}
}

// sendAudio sends audio to a stream.
func sendAudio(
	stream transcribe.TranscribeService_StreamingRecognizeClient,
	cfg *transcribe.RecognitionConfig,
	audio io.Reader,
	bufSize uint32,
) error {
	// The first message needs to be a config message, and all subsequent
	// messages must be audio messages.

	// Send the recognition config
	if err := stream.Send(&transcribe.StreamingRecognizeRequest{
		Request: &transcribe.StreamingRecognizeRequest_Config{Config: cfg},
	}); err != nil {
		// if this failed, we don't need to CloseSend
		return err
	}

	// Stream the audio.
	buf := make([]byte, bufSize)
	for {
		n, err := audio.Read(buf)
		if n > 0 {
			if err2 := stream.Send(&transcribe.StreamingRecognizeRequest{
				Request: &transcribe.StreamingRecognizeRequest_Audio{
					Audio: &transcribe.RecognitionAudio{Data: buf[:n]},
				},
			}); err2 != nil {
				// if we couldn't Send, the stream has
				// encountered an error and we don't need to
				// CloseSend.
				return err2
			}
		}

		if err != nil {
			// err could be io.EOF, or some other error reading from
			// audio.  In any case, we need to CloseSend, send the
			// appropriate error to errCh and return from the function
			if err2 := stream.CloseSend(); err2 != nil {
				return err2
			}
			if err != io.EOF {
				return err
			}
			return nil

		}
	}
}

8 - Hybrid vs End-to-End Models

Differences between gen-1 and gen-2 models, and choosing the right model type

Cobalt’s Transcribe engine supports two types of models:

  • Hybrid (gen-1) - A Hybrid model consists of a sequence of independent models that when chained together can convert audio to words. This type of model no longer produces state-of-the art accuracy but still remains dominant in many commercial ASR applications.
  • End-to-End (gen-2) - An End-to-End (E2E) model is mostly a single large neural network that can convert audio directly to text transcripts (or something very close that requires little additional processing)

A hybrid model can be viewed as a sequence of several different models glued together in a particular sequence to convert audio to text. The cascade of models will (1) convert audio to features based on the amount of energy in different frequency ranges, (2) use a neural network to predict context-dependent sounds (phones) that are present for every ~10 milliseconds of audio, (3) convert the context-dependent phones to context-endependent phones (ex: the ’th’ sound in ’the’ would be an example phone), (4) convert the sequence of phones into a sequence of words using a lexicon model that is a manually curated dictionary of words and the expected sounds/pronunciations for each word, (5) convert the identified possible sequences of words into the most likely sequence of words using a Language Model trained on a large amount of text (helps to choose between ambiguity like “WRECK A NICE BEACH” vs “RECOGNIZE SPEECH”).

An End-to-End speech recognition model is fairly straightforward by comparison. Rather than a series of small models, the bulk of the transcript is performed by one large neural network model. Depending on the specific E2E architecture, there may be a small amount of light feature generation on the input side of E2E the neural network and a little bit of processing on the output side to put together the transcript, but the decoding process is much more straightforward than the hybrid approach and most of the word is performed in one large neural network.

Selecting between Hybrid and E2E Model Types

Advantages of E2E (gen-2) models:

  • High Accuracy - Our E2E models push the state-of-the art in speech recognition accuracy in a variety of diverse use cases, and typically produce 30-50% fewer errors than gen-1 models. You can take a look at the word error rates of both hybrid and E2E models on several industry-standard test datasets here.
  • Sample Rate Flexibility - All of our E2E models can transcribe both 8khz telephone audio and 16khz audio without any loss in accuracy.
  • Out-of-Vocabulary Word Support - Even words never seen during training can very often be recognized correctly.
  • Parallel Processing - The transcription of a single audio file can be easily run in parallel across multiple CPUs or on a GPU.
  • Easier Training - training models is more straightforward (if adapting or fully re-training a model to be optimized for a particular use case).
  • Low Resource Language Support - Less information is required about a new language to train a model for a new language. The phones/sounds, pronunciations of words, and vocabulary are not required. Usually, much less training data is also be needed to produce a suitable recognition model.

Advantages of Hybrid (gen-1) models:

  • Low Latency - Our hybrid models can achieve very low latency (<100ms). E2E models can be run with settings that reduce latency, but for those models there is an impact to Word-Error-Rate and the latency will not be as low as the hybrid model.
  • Efficient Transcription - Each CPU core can transcribe several audio streams at the same time.
  • Easy Customization - New words and/or pronunciations can be added at transcription time. We also offer tools that allow you to re-build models with your own text-only data that allows the Hybrid model to be more accurate on a target domain.
  • More suitable for Embedded Devices - Hybrid models can be trained to have relatively light CPU/memory/storage requirements if it needs to run on an embedded device
  • Constrained Use Cases - If transcription is being deployed in a use case that has a limited vocabulary and/or grammar (not general transcription), the hybrid model can be trained or adapted to target this use case and achieve extremely high accuracy. Examples would be voice command-and-control of a device, or users speaking from a list of commands. Multiple grammars can even be supported and swapped in/out of the recognizer when it is running.
  • Confidence - Per-word confidence estimates are more accurate for Hybrid models than E2E models.
  • Alternate Words - Hybrid models can return rich results beyond a 1-best transcript that contains potential alternate words/sentences for the transcribed audio.
  • Less Compute Required for Training - Training and adapting speech models requires fewer GPU/compute resources.

End-to-End models are likely to be the best choice for customers that are primarily concerned with maximizing accuracy for general transcription. However, the hybrid models may be more appropriate and even more accurate under some conditions: very low latency streaming, low compute/memory embedded transcription, highly custom/unique vocabulary, a very narrow domain (ex: speaking a small number of device directed commands), vocabulary or expected command sets that can change often (even between each audio stream passed as input). By supporting both types of models and offering several different options for model customization, Cobalt is able to satisfy nearly any use case that a customer may require. The support of multiple model types also future-proofs the service by ensuring that if an improved type of speech recognition model becomes available in the future, users of cobalt-transcribe will be able to start using it with minimal changes to their API.

Our gen-2 E2E models currently do not support word level confidence, confusion networks outputs, recognition context, or GPU support during decoding. However, these features will be added to the E2E models soon.

9 - API Reference

Detailed reference for API requests and types.

The API is defined as a protobuf spec, so native bindings can be generated in any language with gRPC support. We recommend using buf to generate the bindings.

This section of the documentation is auto-generated from the protobuf spec. The service contains the methods that can be called, and the “messages” are the data structures (objects, classes or structs in the generated code, depending on the language) passed to and from the methods.

TranscribeService

Service that implements the Cobalt Transcribe Speech Recognition API.

Version

Version(VersionRequest) VersionResponse

Queries the version of the server.

ListModels

ListModels(ListModelsRequest) ListModelsResponse

Retrieves a list of available speech recognition models.

StreamingRecognize

StreamingRecognize(StreamingRecognizeRequest) StreamingRecognizeResponse

Performs bidirectional streaming speech recognition. Receive results while sending audio. This method is only available via GRPC and not via HTTP+JSON. However, a web browser may use websockets to use this service.

CompileContext

CompileContext(CompileContextRequest) CompileContextResponse

Compiles recognition context information, such as a specialized list of words or phrases, into a compact, efficient form to send with subsequent StreamingRecognize requests to customize speech recognition. For example, a list of contact names may be compiled in a mobile app and sent with each recognition request so that the app user’s contact names are more likely to be recognized than arbitrary names. This pre-compilation ensures that there is no added latency for the recognition request. It is important to note that in order to compile context for a model, that model has to support context in the first place, which can be verified by checking its ModelAttributes.ContextInfo obtained via the ListModels method. Also, the compiled data will be model specific; that is, the data compiled for one model will generally not be usable with a different model.

Messages

  • If two or more fields in a message are labeled oneof, then each method call using that message must have exactly one of the fields populated
  • If a field is labeled repeated, then the generated code will accept an array (or struct, or list depending on the language).

AudioFormatRAW

Details of audio in raw format

Fields

  • encoding (AudioEncoding ) Encoding of the samples. It must be specified explicitly and using the default value of AUDIO_ENCODING_UNSPECIFIED will result in an error.

  • bit_depth (uint32 ) Bit depth of each sample (e.g. 8, 16, 24, 32, etc.). This is a required field.

  • byte_order (ByteOrder ) Byte order of the samples. This field must be set to a value other than BYTE_ORDER_UNSPECIFIED when the bit_depth is greater than 8.

  • sample_rate (uint32 ) Sampling rate in Hz. This is a required field.

  • channels (uint32 ) Number of channels present in the audio. E.g.: 1 (mono), 2 (stereo), etc. This is a required field.

CompileContextRequest

The top-level message sent by the client for the CompileContext request. It contains a list of phrases or words, paired with a context token included in the model being used. The token specifies a category such as “menu_item”, “airport”, “contact”, “product_name” etc. The context token is used to determine the places in the recognition output where the provided list of phrases or words may appear. The allowed context tokens for a given model can be found in its ModelAttributes.ContextInfo obtained via the ListModels method.

Fields

  • model_id (string ) Unique identifier of the model to compile the context information for. The model chosen needs to support context which can be verified by checking its ModelAttributes.ContextInfo obtained via ListModels.

  • token (string ) The token that is associated with the provided list of phrases or words (e.g “menu_item”, “airport” etc.). Must be one of the tokens included in the model being used, which can be retrieved by calling the ListModels method.

  • phrases (ContextPhrase repeated) List of phrases and/or words to be compiled.

CompileContextResponse

The message returned to the client by the CompileContext method.

Fields

  • context (CompiledContext ) Context information in a compact form that is efficient for use in subsequent recognition requests. The size of the compiled form will depend on the amount of text that was sent for compilation. For 1000 words it’s generally less than 100 kilobytes.

CompiledContext

Context information in a compact form that is efficient for use in subsequent recognition requests. The size of the compiled form will depend on the amount of text that was sent for compilation. For 1000 words it’s generally less than 100 kilobytes.

Fields

  • data (bytes ) The context information compiled by the CompileContext method.

ConfusionNetworkArc

An Arc inside a Confusion Network Link

Fields

  • word (string ) Word in the recognized transcript

  • confidence (double ) Confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized.

  • features (ConfusionNetworkArcFeatures ) Features related to this arc

ConfusionNetworkArcFeatures

Features related to confusion network arcs

Fields

ConfusionNetworkArcFeatures.ConfidenceEntry

Fields

A Link inside a confusion network

Fields

  • start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio received by the recognizer and corresponding to the start of this link

  • duration_ms (uint64 ) Duration in milliseconds of the current link in the confusion network

  • arcs (ConfusionNetworkArc repeated) Arcs between this link

ContextInfo

Model information specifc to supporting recognition context.

Fields

  • supports_context (bool ) If this is set to true, the model supports taking context information into account to aid speech recognition. The information may be sent with with recognition requests via RecognitionContext inside RecognitionConfig.

  • allowed_context_tokens (string repeated) A list of tokens (e.g “name”, “airport” etc.) that serve has placeholders in the model where a client provided list of phrases or words may be used to aid speech recognition and produce the exact desired recognition output.

ContextPhrase

A phrase or word that is to be compiled into context information that can be later used to improve speech recognition during a StreamingRecognize call. Along with the phrase or word itself, there is an optional boost parameter that can be used to boost the likelihood of the phrase or word in the recognition output.

Fields

  • text (string ) The actual phrase or word.

  • boost (float ) This is an optional field. The boost factor is a positive number which is used to multiply the probability of the phrase or word appearing in the output. This setting can be used to differentiate between similar sounding words, with the desired word given a bigger boost factor.

    By default, all phrases or words provided in the RecongitionContext are given an equal probability of occurring. Boost factors larger than 1 make the phrase or word more probable and boost factors less than 1 make it less likely. A boost factor of 2 corresponds to making the phrase or word twice as likely, while a boost factor of 0.5 means half as likely.

ListModelsRequest

The top-level message sent by the client for the ListModels method.

ListModelsResponse

The message returned to the client by the ListModels method.

Fields

  • models (Model repeated) List of models available for use that match the request.

Model

Description of a Transcribe Model

Fields

  • id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for recognition, and is specified in the RecognitionConfig message.

  • name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their recognition task.

  • attributes (ModelAttributes ) Model attributes

ModelAttributes

Attributes of a Transcribe Model

Fields

  • sample_rate (uint32 ) Audio sample rate supported by the model

  • context_info (ContextInfo ) Attributes specifc to supporting recognition context.

RecognitionAlternative

A recognition hypothesis

Fields

  • transcript_formatted (string ) Text representing the transcription of the words that the user spoke.

    The transcript will be formatted according to the servers formatting configuration. If you want the raw transcript, please see the field transcript_raw. If the server is configured to not use any formatting, then this field will contain the raw transcript.

    As an example, if the spoken utterance was “four people”, and the server was configured to format numbers, this field would be set to “4 people”.

  • transcript_raw (string ) Text representing the transcription of the words that the user spoke, without any formatting applied. If you want the formatted transcript, please see the field transcript_formatted.

    As an example, if the spoken utterance was four people, this field would be set to “FOUR PEOPLE”.

  • start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio received by the recognizer and corresponding to the start of this utterance.

  • duration_ms (uint64 ) Duration in milliseconds of the current utterance in the spoken audio.

  • confidence (double ) Confidence estimate between 0 and 1. A higher number represents a higher likelihood of the output being correct.

  • word_details (WordDetails ) Word-level details corresponding to the transcripts. This is available only if enable_word_details was set to true in the RecognitionConfig.

RecognitionAudio

Audio to be sent to the recognizer

Fields

RecognitionConfig

Configuration for setting up a Recognizer

Fields

  • model_id (string ) Unique identifier of the model to use, as obtained from a Model message.

  • oneof audio_format.audio_format_raw (AudioFormatRAW ) Audio is raw data without any headers

  • oneof audio_format.audio_format_headered (AudioFormatHeadered ) Audio has a self-describing header. Headers are expected to be sent at the beginning of the entire audio file/stream, and not in every RecognitionAudio message.

    The default value of this type is AUDIO_FORMAT_HEADERED_UNSPECIFIED. If this value is used, the server may attempt to detect the format of the audio. However, it is recommended that the exact format be specified.

  • selected_audio_channels (uint32 repeated) This is an optional field. If the audio has multiple channels, this field can be configured with the list of channel indices that should be considered for the recognition task. These channels are 0-indexed.

    Example: [0] for a mono file, [0, 1] for a stereo file. Example: [1] to only transcribe the second channel of a stereo file.

    If this field is not set, all the channels in the audio will be processed.

    Channels that are present in the audio may be omitted, but it is an error to include a channel index in this field that is not present in the audio. Channels may be listed in any order but the same index may not be repeated in this list.

    BAD: [0, 2] for a stereo file; BAD: [0, 0] for a mono file.

  • audio_time_offset_ms (uint64 ) This is an optional field. It can be used to indicate that the audio being streamed to the recognizer is offset from the original stream by the provided duration in milliseconds. This offset will be added to all timestamps in results returned by the recognizer.

    The default value of this field is 0ms, so the timestamps in the recognition result will not be modified.

    Example use case where this field can be helpful: if a recognition session was interrupted and audio needs to be sent to a new session from the point where the session was previously interrupted, the offset could be set to the point where the interruption had happened.

  • enable_word_details (bool ) This is an optional field. If this is set to true, each result will include word level details of the transcript. These details are specified in the WordDetails message. If set to false, no word-level details will be returned. The default is false.

  • enable_confusion_network (bool ) This is an optional field. If this is set to true, each result will include a confusion network. If set to false, no confusion network will be returned. The default is false. If the model being used does not support returning a confusion network, this field will have no effect. Tokens in the confusion network always correspond to tokens in the transcript_raw returned.

  • metadata (RecognitionMetadata ) This is an optional field. If there is any metadata associated with the audio being sent, use this field to provide it to the recognizer. The server may record this metadata when processing the request. The server does not use this field for any other purpose.

  • context (RecognitionContext ) This is an optional field for providing any additional context information that may aid speech recognition. This can also be used to add out-of-vocabulary words to the model or boost recognition of specific proper names or commands. Context information must be pre-compiled via the CompileContext() method.

RecognitionConfusionNetwork

Confusion network in recognition output

Fields

RecognitionContext

A collection of additional context information that may aid speech recognition. This can be used to add out-of-vocabulary words to the model or to boost recognition of specific proper names or commands.

Fields

  • compiled (CompiledContext repeated) List of compiled context information, with each entry being compiled from a list of words or phrases using the CompileContext method.

RecognitionError

Developer-facing error message about a non-fatal recognition issue.

Fields

RecognitionMetadata

Metadata associated with the audio to be recognized.

Fields

  • custom_metadata (string ) Any custom metadata that the client wants to associate with the recording. This could be a simple string (e.g. a tracing ID) or structured data (e.g. JSON).

RecognitionResult

A recognition result corresponding to a portion of audio.

Fields

  • alternatives (RecognitionAlternative repeated) An n-best list of recognition hypotheses alternatives

  • is_partial (bool ) If this is set to true, it denotes that the result is an interim partial result, and could change after more audio is processed. If unset, or set to false, it denotes that this is a final result and will not change.

    Servers are not required to implement support for returning partial results, and clients should generally not depend on their availability.

  • cnet (RecognitionConfusionNetwork ) If enable_confusion_network was set to true in the RecognitionConfig, and if the model supports it, a confusion network will be available in the results.

  • audio_channel (uint32 ) Channel of the audio file that this result was transcribed from. Channels are 0-indexed, so the for mono audio data, this value will always be 0.

StreamingRecognizeRequest

The top-level messages sent by the client for the StreamingRecognize method. In this streaming call, multiple StreamingRecognizeRequest messages should be sent. The first message must contain a RecognitionConfig message only, and all subsequent messages must contain RecognitionAudio only. All RecognitionAudio messages must contain non-empty audio. If audio content is empty, the server may choose to interpret it as end of stream and stop accepting any further messages.

Fields

StreamingRecognizeResponse

The messages returned by the server for the StreamingRecognize request. Multiple messages of this type will be delivered on the stream, for multiple results, as soon as results are available from the audio submitted so far. If the audio has multiple channels, the results of all channels will be interleaved. Results of each individual channel will be chronological. However, there is no guarantee of the order of results across channels.

Clients should process both the result and error fields in each message. At least one of these fields will be present in the message. If both result and error are present, the result is still valid.

Fields

  • result (RecognitionResult ) A new recognition result. This field will be unset if a new result is not yet available.

  • error (RecognitionError ) A non-fatal error message. If a server encountered a non-fatal error when processing the recognition request, it will be returned in this message. The server will continue to process audio and produce further results. Clients can continue streaming audio even after receiving these messages. This error message is meant to be informational.

    An example of when these errors maybe produced: audio is sampled at a lower rate than expected by model, producing possibly less accurate results.

    This field will be unset if there is no error to report.

VersionRequest

The top-level message sent by the client for the Version method.

VersionResponse

The message sent by the server for the Version method.

Fields

  • version (string ) Version of the server handling these requests.

WordDetails

Fields

  • formatted (WordInfo repeated) Word-level information corresponding to the transcript_formatted field.

  • raw (WordInfo repeated) Word-level information corresponding to the transcript_raw field.

WordInfo

Word level details for recognized words in a transcript

Fields

  • word (string ) The actual word in the text

  • confidence (double ) Confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized.

  • start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio received by the recognizer and corresponding to the start of this spoken word.

  • duration_ms (uint64 ) Duration in milliseconds of the current word in the spoken audio.

Enums

AudioEncoding

The encoding of the audio data to be sent for recognition.

Name Number Description
AUDIO_ENCODING_UNSPECIFIED 0 AUDIO_ENCODING_UNSPECIFIED is the default value of this type and will result in an error.
AUDIO_ENCODING_SIGNED 1 PCM signed-integer
AUDIO_ENCODING_UNSIGNED 2 PCM unsigned-integer
AUDIO_ENCODING_IEEE_FLOAT 3 PCM IEEE-Float
AUDIO_ENCODING_ULAW 4 G.711 mu-law
AUDIO_ENCODING_ALAW 5 G.711 a-law

AudioFormatHeadered

Name Number Description
AUDIO_FORMAT_HEADERED_UNSPECIFIED 0 AUDIO_FORMAT_HEADERED_UNSPECIFIED is the default value of this type.
AUDIO_FORMAT_HEADERED_WAV 1 WAV with RIFF headers
AUDIO_FORMAT_HEADERED_MP3 2 MP3 format with a valid frame header at the beginning of data
AUDIO_FORMAT_HEADERED_FLAC 3 FLAC format
AUDIO_FORMAT_HEADERED_OGG_OPUS 4 Opus format with OGG header

ByteOrder

Byte order of multi-byte data

Name Number Description
BYTE_ORDER_UNSPECIFIED 0 BYTE_ORDER_UNSPECIFIED is the default value of this type.
BYTE_ORDER_LITTLE_ENDIAN 1 Little Endian byte order
BYTE_ORDER_BIG_ENDIAN 2 Big Endian byte order

Scalar Value Types

.proto Type C++ Type C# Type Go Type Java Type PHP Type Python Type Ruby Type

double
double double float64 double float float Float

float
float float float32 float float float Float

int32
int32 int int32 int integer int Bignum or Fixnum (as required)

int64
int64 long int64 long integer/string int/long Bignum

uint32
uint32 uint uint32 int integer int/long Bignum or Fixnum (as required)

uint64
uint64 ulong uint64 long integer/string int/long Bignum or Fixnum (as required)

sint32
int32 int int32 int integer int Bignum or Fixnum (as required)

sint64
int64 long int64 long integer/string int/long Bignum

fixed32
uint32 uint uint32 int integer int Bignum or Fixnum (as required)

fixed64
uint64 ulong uint64 long integer/string int/long Bignum

sfixed32
int32 int int32 int integer int Bignum or Fixnum (as required)

sfixed64
int64 long int64 long integer/string int/long Bignum

bool
bool bool bool boolean boolean boolean TrueClass/FalseClass

string
string string string String string str/unicode String (UTF-8)

bytes
string ByteString []byte ByteString string str String (ASCII-8BIT)

10 - FAQ

Answers to Frequently Asked Questions.

System Requirements

Does Cobalt Transcribe run on Linux?

Yes, you can run Cobalt Transcribe on Linux natively or via Docker. Check out the documentation to get started.

Does Cobalt Transcribe run on macOS?

Yes, you can run Cobalt Transcribe on macOS via Docker Desktop for evaluation purpose. Check out the documentation to get started. However, we don’t recommend running Cobalt Transcribe on macOS in production.

Does Cobalt Transcribe run on Windows?

Yes, Windows is supported via Docker Desktop for evaluation. We don’t recommend running Cobalt Transcribe on Windows in production.

Does Cobalt Transcribe run on embedded devices?

Yes, Cobalt Transcribe supports embedded devices such as Raspberry Pi, Tegra etc. However, you’ll probably want to contact us for a smaller model due to memory limitations.

Does Cobalt Transcribe run on Android or iOS?

Android and iOS require a specific implementation strategy. Please contact us for support working with Android or iOS.

What are the technical requirements for a scaled on-premise deployment?

Each containerized instance of Cobalt Transcribe should be provided with 4 cores and 8 GB RAM when using for streaming recognition.

Product Features

Which languages does Cobalt Transcribe support?

Cobalt offers speech recognition in English (US & UK), Spanish, French, German, Russian, Brazilian Portuguese, Korean, Japanese, Swahili, Cambodian. Please contact sales@cobaltspeech.com to learn more. Cobalt is always looking for partners to develop, sell, and/or market speech technology in other languages.

Can I use Cobalt Transcribe in the field of telephony such as contact centers?

Yes, Cobalt Transcribe uses low latency and/or 8kHz telephony models for transcribing telephone calls and contact center conversations. Additional insight is retrivable through high-precision timestamps and n-best transcripts. Cobalt technology provides solutions for contact centers including summarization, redaction, and sentiment analysis of conversations.

Can I redact Personal Identifiable Inforamtion (PII) from the output transcript?

PII redaction is a separate service that can be integrated with Cobalt Transcribe. Please contact us for details.

Does Cobalt Transcribe support real-time transcription?

Yes, Cobalt Transcribe can accept audio samples as they are recorded and will provide streaming output with relatively low latency. It also supports output of partial results, which is available almost immediately during decoding. This feature is useful in real-time interfaces where users can see what is being recognized nearly immediately as they speak, but some words in the preliminary output may be corrected in a final result as more audio and context is available. Cobalt transcribe performs automatic end-pointing to determine the end of utterance.

Recognition Accuracy

How accurate is Cobalt Transcribe?

Cobalt Transcribe is available in two different architectures: Hybrid and End-to-End. We have evaluated the word error rate (WER) of both versions of Cobalt Transcribe on several industry-standard test datasets:

Dataset Domain Hybrid WER End-to-End WER
CommonVoice-test Read Speech, Crowd Sourced 11.5% 5.0%
Librispeech-test Read Speech, Audiobooks, Crowd Sourced 6.2% 2.2%
Tedlium-test Spontaneous Speech, Presentations 7.5% 3.9%
WallStreetJournal-test Read Speech, News 7.4% 5.8%
MultilingualLibriSpeech-test Read Speech, Audiobooks, Crowd Sourced 8.8% 4.0%
OHSU-test Spontaneous Speech, Children’s Speech 16.9% 12.4%

The WER is dependent on a number of factors such as the train-test split, formatting of the decoded transcript, accuracy vs. latency trade-offs, etc. Therefore, these numbers are not directly comparable to the WERs reported by other service providers, even on the same dataset.

How do I further improve audio transcription accuracy?

Our base models are trained on a large amount of audio and text to ensure robust accuracy on a variety of use cases. The configurable nature of Cobalt Transcribe’s models allows for updates that can improve transcription accuracy specific to your use-case:

  • Adding vocabulary and context via the RecognitionContext API: This will help you to capture proper names and domain-specific terminology correctly.

  • End-to-end (E2E) models typically have better accuracy and more robust recognition performance for different accents and dialects. However, E2E models are more computationally expensive, and tend to have higher latency when compared to hybrid models. If you would like to try one out please contact us at sales@cobaltspeech.com

  • Continuous adaptation of acoustic models (AMs) using Cobalt Transcribe Tuner. This continuous learning framework automatically updates the acoustic model using your production data. For more information, contact sales@cobaltspeech.com.

  • Cobalt’s speech scientists can work with you to optimize accuracy for your conditions and application: For speech recognition in a specific acoustic environment or domain-specific use case (e.g. noisy factory floor, airport, surgical lab, patient-doctor conversations, quarterly earnings calls, etc.) we can adapt the acoustic and language models using relevant audio and text data.

I am starting a new speech project. How can I get the best transcription accuracy?

The transcription accuracy depends on several factors such as:

  • Appropriate sampling rate (8kHz / 16kHz) and matching model
  • Audio format : Lossless codecs like wav or flac may be better over mp3 or ogg
  • Microphone selection, placement, directionality (Cardiod, Omni)
  • Trade-offs between latency and accuracy.
  • Consider constrained grammar or providing recognition context
  • AM and LM adaptation

Recognition Speed/Performance

How do I further improve audio transcription latency?

One way to improve the latency is to make the streaming buffer size smaller. We recommend setting the streaming buffer size between 512 bytes and 4096 bytes. We can also work with you to tune model parameters such as beam search width to reduce the latency. Our speech scientists can also make a smaller model for your application, or tune parameters and model size for optimal latency and accuracy trade-offs. If you’re interested in this, contact sales@cobaltspeech.com.

How long does it take to transcribe audio?

The processing speed of speech to text conversion is measured by the real time factor (RTF) which is the ratio of time taken to transcribe an audio file to the duration of the audio. Cobalt Transcribe has an RTF of 0.16 and 0.4 using our general purpose hybrid and E2E models, respectively. That means transcribing one hour of speech typically takes approximately 10 minutes for the hybrid model and about 24 minutes for the E2E model.

Are there limits on the number of jobs that can be processed concurrently?

Number of concurrent audio channels depends on the models being used and the CPU. Our general purpose models typically support 6 channels per core for realtime streams when running on a CPU such as C6i EC2 instances.

What does it cost, in terms of CPU resources, to transcribe a million minutes of speech?

With our standard models, Cobalt Transcribe can run 6 channels per core for realtime speech input, assuming a c6i EC2 processor. A 4-core processor can therefore transcribe 24 minutes of audio per minute of wall-clock time. At current AWS pricing, an EC2 instance costs $0.17 per hour. Therefore, the cost for a million minutes of speech is approximately $120.

Costs can be significantly lower when using c7g instances on EC2. Contact us for more information.

How scalable is Cobalt Transcribe? How can I carry out large deployments?

Cobalt Transcribe can scale between large-scale servers and low-power embedded hardware. For large-scale deployments, you can increase the number of concurrent audio channels for faster decoding. Cobalt Transcribe has the capability to decode using separate threads. Our general purpose models typically support 6 channels per core when running on a CPU such as C6i EC2 instances. Moreover, you can deploy Cobalt Transcribe via docker and kubernetes to automatically scale up (or down) your resources according to your demand in a cost effective manner without causing a decline in performance.

Recognition accuracy vs speed/performance

How do I choose between hybrid and end-to-end models?

End-to-End models are likely to be the best choice for customers that are primarily concerned with maximizing accuracy for general transcription. However, the hybrid models may be more appropriate and even more accurate under some conditions: very low latency streaming, low compute/memory embedded transcription, highly custom/unique vocabulary, a very narrow domain (ex: speaking a small number of device directed commands), vocabulary or expected command sets that can change often (even between each audio stream passed as input). For detailed comparison between hybrid and end-to-end models, you may take a look at Hybrid vs End-to-End Models.

Supported Audio Formats

What type of media files does Cobalt Transcribe support?

Cobalt Transcribe supports common media formats, such as WAV, MP3, FLAC, Ogg and audio encoding like PCM, mu-law, and a-law. Raw audio format is also supported.

My audio source is 48kHz/44.1kHz. Does Cobalt Transcribe support that?

Yes, we resample the audio to an appropriate sampling rate automatically. Please note that our default sampling rates are 16kHz for wideband models and 8kHz for telephony models. Accuracy improvements for higher sampling rates (than 16kHz) are minimal, and not generally worth the associated increase in data rates and data transfer requirements, or the additional overhead for resampling.

API and Integration

How do I test Cobalt Transcribe?

We are happy to offer free trials under a software evaluation license of Cobalt Transcribe with all available features. Typically, our software evaluation licenses are for a period of 30 days. To get started with Cobalt Transcribe, please check the quick start.

Please try our Cobalt Transcribe Speech Recognition Demo for simple evaluation purposes. This demo server is for testing and demonstration purposes only and is not guaranteed to support high availability or high volume.

Which SDKs are available to integrate Cobalt Transcribe into my project?

Cobalt Transcribe uses gRPC to define its APIs. The API is defined as a protobuf schema, and grpc tools can be used to generate client SDKs in several languages, including Python, Go, C++, Java, C#, etc.

Can I use my own models with Cobalt Transcribe?

Cobalt Transcribe models are trained on thousands of hours of data and produce very accurate transcripts over a wide range of different conditions. We provide tools and services that allow our models to be tailored towards your particular use case if additional accuracy is desired. If customers have their own existing Kaldi or wav2vec 2.0 models, Cobalt Transcribe supports the use of those external models.

Product Comparison

What are the benefits of Cobalt Transcribe over other speech-to-text services?

Compared to other services, Cobalt Transcribe offers the following advantages:

  • You can host the Cobalt Transcribe server on your system locally or in your virtual private cloud. This enables you to keep your data private and secure.
  • Cobalt Transcribe has low latency. It is particularly useful for embedded devices and real-time applications.
  • Cobalt Transcribe is highly customizable. Adaptaing the language and acoustic models to your specific terminology will improve performance.
  • Cobalt’s experienced speech scientists are available to adapt the LM and AM to your target domain for the best recognition results.
  • The Cobalt Transcribe API supports several outputs: 1-best results, per-word start times and durations, per-word confidences, n-best transcripts, confusion networks, and lattices.

11 -

Cobalt&rsquo;s SDK Documentation

Cobalt Transcribe SDK – Cobalt