This is the multi-page printable view of this section. Click here to print.
- 1: Speech Recognition & Transcription
- 1.1: Transcribe
- 1.1.1: Getting Started
- 1.1.2: Running Quick Tests
- 1.1.3: Generating SDKs
- 1.1.4: Connecting to the Server
- 1.1.5: Streaming Recognition
- 1.1.6: Recognition Configurations
- 1.1.7: Recognition Context
- 1.1.8: Hybrid vs End-to-End Models
- 1.1.9: API Reference
- 1.1.10: FAQ
- 1.1.11:
- 2: Voice Intelligence
- 2.1: Privacy Screen
- 2.1.1: Server Setup
- 2.1.2: Connecting to the Server
- 2.1.3: Text Redaction
- 2.1.4: Concurrency
- 2.1.5: Privacy Screen Client
- 2.1.6: Redaction Categories
- 2.1.7: Redaction Languages
- 2.1.8: Prerequisites and System Requirements
- 2.1.9: Proto API Reference
- 2.2: VoiceBio
- 2.2.1: Getting Started
- 2.2.2: Generating SDKs
- 2.2.3: Connecting to the Server
- 2.2.4: Streaming Enrollment
- 2.2.5: Streaming Verification
- 2.2.6: Streaming Identification
- 2.2.7: Comparing Voiceprints
- 2.2.8: Vectorizing Voiceprints
- 2.2.9: API Reference
- 2.2.10:
- 3: Voice User Interfaces
- 3.1: VoiceGen
- 3.1.1: Getting Started
- 3.1.2: Generating SDKs
- 3.1.3: Connecting to the Server
- 3.1.4: Streaming Synthesis
- 3.1.5: API Reference
- 3.1.6:
1.1 - Transcribe
Cobalt’s Transcribe engine is a state-of-the-art speech recognition system. Cobalt Transcribe supports two different DNN architectures:
-
Hybrid models combine separately tunable Acoustic Models, Lexicons, and Language Models, making them highly customizable for specific use cases. Hybrid models support extremely low-latency partial results.
-
End-to-end models go straight from sounds to words in the same DNN. They tend to be more accurate for general use cases, particularly for systems in which sub-second response time is not required.
Cobalt Transcribe is a highly flexible system that can run on-premise, in your private cloud, or fully embedded on your device. Your data – both the audio and the transcripts – never leave your control.
The SDK is based on a gRPC API and client code can be easily generated for different languages using the proto definition, including C++, C#, Go, Java and Python, and can add support for more languages as required.
Once running, Transcribe’s API provides a method to which you can stream audio. This audio can either be from a microphone or a file. We recommend uncompressed WAV as the encoding, but support other formats such as MP3, ulaw etc.

Transcribe’s API provides a number of options for returning the speech recognition results. The results are passed back using Google’s protobuf library, allowing them to be handled natively by your application. Transcribe can estimate its confidence in the transcription result at the word or utterance level, along with timestamps of the words. Confidence scores are in the range 0-1. Transcribe’s output options are described below.
Automatic Transcription Results
The simplest result that Transcribe returns is its best guess at the transcription of your audio. Transcribe recognizes the audio you are streaming, listens for the end of each utterance, and returns the speech recognition result.
Transcribe maintains its transcriptions in an N-best list, i.e. is the top N transcriptions from the recognizer. The best ASR result is the first entry in this list.
Click here to see an example json representation of Transcribe’s N-best list with utterance-level confidence scores
{
"alternatives": [
{
"transcript": "TOMORROW IS A NEW DAY",
"confidence": 0.514
},
{
"transcript": "TOMORROW IS NEW DAY",
"confidence": 0.201
},
{
"transcript": "TOMORROW IS A <UNK> DAY",
"confidence": 0.105
},
{
"transcript": "TOMORROW IS ISN'T NEW DAY",
"confidence": 0.093
},
{
"transcript": "TOMORROW IS A YOUR DAY",
"confidence": 0.087
}
],
}
A single stream may consist of multiple utterances separated by silence. Transcribe handles each utterance separately.
For longer utterances, it is often useful to see the partial speech recognition results while the audio is being streamed. For example, this allows you to see what the ASR system is predicting in real-time while someone is speaking. Transcribe supports both partial and final ASR results.
Confusion Network
A Confusion Network is a form of speech recognition output that’s been turned into a compact graph representation of many possible transcriptions, as here:

Note that <eps> in this representation is silence.
Click here to see an example json representation of this Confusion Network object, with time stamps and word-level confidence scores
{
"cnet": {
"links": [
{
"duration": "1.350s",
"arcs": [
{
"word": "<eps>",
"confidence": 1.0
}
],
"startTime": "0s"
},
{
"duration": "0.690s",
"arcs": [
{
"word": "TOMORROW",
"confidence": 1.0
}
],
"startTime": "1.350s"
},
{
"duration": "0.080s",
"arcs": [
{
"word": "<eps>",
"confidence": 1.0
}
],
"startTime": "2.040s"
},
{
"duration": "0.168s",
"arcs": [
{
"word": "IS",
"confidence": 0.892
},
{
"word": "<eps>",
"confidence": 0.108
}
],
"startTime": "2.120s"
},
{
"duration": "0.010s",
"arcs": [
{
"word": "<eps>",
"confidence": 1.0
}
],
"startTime": "2.288s"
},
{
"duration": "0.093s",
"arcs": [
{
"word": "A",
"confidence": 0.620
},
{
"word": "<eps>",
"confidence": 0.233
},
{
"word": "ISN'T",
"confidence": 0.108
},
{
"word": "THE",
"confidence": 0.039
}
],
"startTime": "2.298s"
},
{
"duration": "0.005s",
"arcs": [
{
"word": "<eps>",
"confidence": 1.0
}
],
"startTime": "2.391s"
},
{
"duration": "0.273s",
"arcs": [
{
"word": "NEW",
"confidence": 0.661
},
{
"word": "<UNK>",
"confidence": 0.129
},
{
"word": "YOUR",
"confidence": 0.107
},
{
"word": "YOU",
"confidence": 0.102
}
],
"startTime": "2.396s"
},
{
"duration": "0s",
"arcs": [
{
"word": "<eps>",
"confidence": 1.0
}
],
"startTime": "2.670s"
},
{
"duration": "0.420s",
"arcs": [
{
"word": "DAY",
"confidence": 0.954
},
{
"word": "TODAY",
"confidence": 0.044
},
{
"word": "<UNK>",
"confidence": 0.002
}
],
"startTime": "2.670s"
},
{
"duration": "0.270s",
"arcs": [
{
"word": "<eps>",
"confidence": 1.0
}
],
"startTime": "3.090s"
}
]
}
}
Formatted output
Many speech recognition systems typically output raw words exactly as spoken, without any formatting which can improve intelligibility. Cobalt Transcribe’s customizable formatting suite enables a variety of intelligent formatting options:
- Capitalizing the first letter of the utterance
- Numbers: “cobalt’s atomic number is twenty seven” -> “Cobalt’s atomic number is 27”
- TrueCasing: “the iphone was launched in two thousand and seven” -> “The iPhone was launched in 2007”
- Ordinals: “summer solstice is twenty first june” -> “Summer solstice is 21st June”
1.1.1 - Getting Started
Using Cobalt Transcribe
-
A typical Transcribe release, provided as a compressed archive, will contain a linux binary (
transcribe-server) for the required native CPU architecture, appropriate Dockerfile and models. -
Cobalt Transcribe runs either locally on linux or using Docker.
-
Cobalt Transcribe will serve the Transcribe GRPC API on port 2727. A web demo will be enabled on port 8080.
-
To quickly try out Transcribe, first start the server as shown below and open the web demo at
http://localhost:8080in your browser to send live microphone input or upload an audio file for transcription. You can also use the SDK to use Transcribe from within your application or just command line.
Info
Thecobalt.license.key file will be provided separately that must be copied into
the directory resulting from decompressing the archive. Please do this before
running the steps below.
Running Transcribe Server Locally on Linux
./transcribe-server
- By default, the binary assumes the presence of a configuration file, located in the same directory, named:
transcribe-server.cfg.toml. A different config file may be specified using the--configargument.
Running Transcribe Server as a Docker Container
To build and run the Docker image for Transcribe, run:
docker build -t cobalt-transcribe .
docker run -p 2727:2727 -p 8080:8080 cobalt-transcribe
How to Get a Copy of the Transcribe Server and Models
Please contact us for finding a product release or Transcribe model best suited to your requirements.
The demo release you will receive is a compressed archive (tar.bz2) and is structured accordingly:
release.tar.bz2
├── COPYING
├── README.md
├── transcribe-server
├── transcribe-server.cfg.toml
├── cobalt.license.key
├── Dockerfile
├── models
│ └── en_US-16khz
├── formatters
│ └── en_US-16khz
│
└── cobalt.license.key [ provided separately, needs to be copied over ]
-
The
README.mdfile contains information about this release and instructions for how to start the server on your system. -
The
transcribe-serveris the server program which is configured using thetranscribe-server.cfg.tomlfile. -
The
Dockerfilecan be used to create a container that will let you run Transcribe server on non-linux systems such as MacOS and Windows. -
The
modelsandformattersdirectories contain your speech recognition and text formatting models. The content of these directories will depend on the models you downloaded.
System Requirements
Cobalt Transcribe runs on Linux. You can run it directly as a linux application, or using Docker.
You can evaluate the product on Windows or Linux using Docker Desktop but we would not recommend this setup for use in a production environment.
A Cobalt Transcribe release typically includes a single Transcribe model together with binaries and config files. The general purpose Transcribe models take up to 4GB disk space, and need a minimum of 4GB RAM when evaluating locally. For production workloads, we recommend configuring containerized applications with each instance allocated with 4 CPUs and 8GB RAM.
Cobalt Transcribe runs on x86_64 CPUs. We also support Arm64 CPUs, including processors such as the Graviton (AWS c7g EC2 instances). Transcribe is significantly more cost effective to run on C7g instances compared to similarly sized Intel or AMD processors, and we can provide you an Arm64 release on request.
To integrate Cobalt Transcribe into your application, please follow the next steps to Generate the SDK in a language of your choice.
1.1.2 - Running Quick Tests
In the release package you received, we have bundled a prebuilt application, a command-line client, that you can use to connect to the server you now have running. You can use this client to send files to the server for saving the transcription, either as a stream of text, or as JSON with additional information.
Obtaining the Client Application
In the release package of Cobalt Transcribe, you should find a folder called transcribe-client, that contains client binaries for multiple platforms.
transcribe-client/
└── bin
├── darwin_amd64
│ └── transcribe-client
└── linux_amd64
└── transcribe-client
This client is implemented in Go, and the code is available for reference.
If you need to build the client yourself, instead of using the pre-packaged version in the release, you can install it using:
go install github.com/cobaltspeech/examples-go/transcribe/transcribe-client@latest
Running The Client
After you follow the Getting Started instructions, you should have Cobalt Transcribe server running and listening on port 2727 on the local machine.
You can run the transcribe client in various ways:
# Transcribe a file using the default address of the Cobalt Transcribe Server (localhost:2727)
transcribe-client recognize input.wav
# Transcribe a file, pointing to some other server address
transcribe-client recognize input.wav --server host:port
# Transcribe a file using the default server address, and save output as a JSON file.
transcribe-client recognize input.wav --output-json output.json
# (Advanced use): List information about the models available on the server.
transcribe-client list
# (Advanced use): Transcribe a file, get word level timestamps and confidences.
transcribe-client recognize input.wav --output-json output.json --recognition-config '{"enable_word_details": true}'
For more details on the recognition-config struct, please see the API spec
# Getting usage information
transcribe-client --help
1.1.3 - Generating SDKs
-
APIs for all Cobalt’s services are defined as a protocol buffer specification or simply a
protofile and be found in thecobaltspeech/protogithub repository. -
The
protofile allows a developer to auto-generate client SDKs for a number of different programming languages. Step by step instructions for generating your own SDK can be found below. -
We provide pre-generated SDKs for a couple of languages. You can choose to use these instead of generating your own. These are listed here along with instructions on how to install / import them into your projects.
Pre-generated SDKs
Golang
-
Pre-generated SDK files for Golang can be found in the
cobaltspeech/go-genprotorepo -
To use it in your Go project, simply import it:
import transcribepb "github.com/cobaltspeech/go-genproto/cobaltspeech/transcribe/v5"
- An example client using the above repo can be found here.
Python
-
Pre-generated SDK files for Python can be found in the
cobaltspeech/py-genprotorepo -
The Python SDK depends on Python >= 3.5. You may use pip to perform a system-wide install, or use virtualenv for a local install. To use it in your Python project, install it:
pip install --upgrade pip
pip install "git+https://github.com/cobaltspeech/py-genproto"
Generating SDKs
Step 1. Installing buf
- To work with
protofiles, we recommend usingbuf, a user-friendly command line tool that can be configured generate documentation, schemas and SDK code for different languages.
# Latest version as of March 14th, 2023.
COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/bin"
VERSION="1.15.1"
URL="https://github.com/bufbuild/buf/releases/download/v${VERSION}/buf-$(uname -s)-$(uname -m)"
curl -L ${URL} -o "${COBALT}/bin/buf"
# Give executable permissions and adding to $PATH.
chmod +x "${COBALT}/bin/buf"
export PATH="${PATH}:${COBALT}/bin"brew install bufbuild/buf/bufStep 2. Getting proto files
- Clone the
cobaltspeech/protorepository:
COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/git"
# Change this to where you want to clone the repo to.
PROTO_REPO="${COBALT}/git/proto"
git clone https://github.com/cobaltspeech/proto "${PROTO_REPO}"
Step 3. Generating code
-
The
cobaltspeech/protorepo provides abuf.gen.yamlconfig file to get you started with a couple of languages. -
Other plugins can be added to the
buf.gen.yamlfile to generate SDK code for more languages. -
To generate the SDKs, simply run the following (assuming the
bufbinary is in your$PATH)
cd "${PROTO_REPO}"
# Removing any previously generated files.
rm -rf ./gen
# Generating code for all proto files inside the `proto` directory.
buf generate proto
- You should now have a folder called
geninside${PROTO_REPO}that contains the generated code. The latest version of the transcribe API is v5. You can import / include / copy the generated files into your projects as per the conventions of different languages.
gen
├── ... other languages ...
└── py
└── cobaltspeech
├── ... other services ...
└── transcribe
└── v5
├── transcribe_pb2_grpc.py
├── transcribe_pb2.py
└── transcribe_pb2.pyigen
├── ... other languages ...
└── go
├── cobaltspeech
│ ├── ...
│ └── transcribe
│ └── v5
│ ├── transcribe_grpc.pb.go
│ └── transcribe.pb.go
└── gw
└── cobaltspeech
├── ...
└── transcribe
└── v5
└── transcribe.pb.gw.gogen
├── ... other languages ...
└── cpp
└── cobaltspeech
├── ...
└── transcribe
└── v5
├── transcribe.grpc.pb.cc
├── transcribe.grpc.pb.h
├── transcribe.pb.cc
├── transcribe.pb.h
├── transcribe.pb.validate.cc
└── transcribe.pb.validate.hStep 4. Installing gPRC and protobuf
- A couple of gRPC and protobuf dependencies are required along with the code generated above. The method of installing them depends on the programming language being used.
- These dependencies and the most common way of installing/ / including them are listed below for some chosen languages.
# It is encouraged to this inside a python virtual environment
# to avoid creating version conflicts for other scripts that may
# be using these libraries.
pip install --upgrade protobuf
pip install --upgrade grpcio
pip install --upgrade google-api-python-clientgo get google.golang.org/protobuf
go get google.golang.org/grpc
go get google.golang.org/genproto# More details on grpc installation can be found at:
# https://grpc.io/docs/languages/cpp/quickstart/
COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/git"
# Latest version as of 14th March, 2023.
VERSION="v1.52.0"
GRPC_REPO="${COBALT}/git/grpc-${VERSION}"
git clone \
--recurse-submodules --depth 1 --shallow-submodules \
-b "${VERSION}" \
https://github.com/grpc/grpc ${GRPC_REPO}
cd "${GRPC_REPO}"
mkdir -p cmake/build
# Change this to where you want to install libprotobuf and libgrpc.
# It is encouraged to install gRPC locally as there is no easy way to
# uninstall gRPC after you’ve installed it globally.
INSTALL_DIR="${COBALT}"
cd cmake/build
cmake \
-DgRPC_INSTALL=ON \
-DgRPC_BUILD_TESTS=OFF \
-DCMAKE_INSTALL_PREFIX=${INSTALL_DIR} \
../..
make -j
make install1.1.4 - Connecting to the Server
-
Once you have your Transcribe server up and running, and have generated the SDK for your project, you can connect to a running instance of Transcribe server, by “dialing” a gRPC connection.
-
First, you need to know the address where the server is running: e.g.
host:grpc_port. By default, this islocalhost:2727and should be logged to the terminal when you first start Transcribe server asgrpcAddr:
2023/03/15 07:54:01 info {"license":"Copyright © 2015--present. Cobalt Speech and Language, Inc. For additional details, including information about open source components used in this software, please see the COPYING file bundled with this program."}
2023/03/15 07:54:01 info {"msg":"reading config file","path":"transcribe-server.cfg.toml"}
2023/03/15 07:54:01 info {"msg":"version","server":"v5.3.5-b70948b","built":"2023-03-14"}
2023/03/15 07:54:01 info {"msg":"server initializing"}
2023/03/15 07:54:01 info {"msg":"server started","grpcAddr":"[::]:8027","httpApiAddr":"[::]:8030","httpOpsAddr":"[::]:8031"}
- The default binding address and port for the gRPC / http server (bundled webpage demo) can be configured in the transcribe-server config file.
Info
If you are hosting your server with Transport Layer Security (TLS) enabled, then please follow the instructions under Connection With TLS. Otherwise, you can follow the instructions for the Default Connection method.Default Connection
- The following code snippet connects to the server and queries its version. It connects to the server using an “insecure” gRPC channel. This would be the case if you have just started up a local instance of Transcribe server without TLS enabled.
import grpc
import cobaltspeech.transcribe.v5.transcribe_pb2_grpc as stub
import cobaltspeech.transcribe.v5.transcribe_pb2 as transcribe
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.TranscribeServiceStub(channel)
# Get server version.
versionResp = client.Version(transcribe.VersionRequest())
print(versionResp)package main
import (
"context"
"fmt"
"os"
"time"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
transcribepb "github.com/cobaltspeech/go-genproto/cobaltspeech/transcribe/v5"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := transcribepb.NewTranscribeServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &transcribepb.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
}Connect with TLS
-
In our recommended setup for deployment, TLS is enabled in the gRPC connection, and when connecting to the server, clients validate the server’s SSL certificate to make sure they are talking to the right party. This is similar to how “https” connections work in web browsers.
-
The following snippets show how to connect to a Transcribe Server that has TLS enabled. They use the cobalt’s self-hosted demo server at
demo.cobaltspeech.com:2727, but you obviously use your own server instance.
Note
Commercial use of the demo server atdemo.cobaltspeech.com:2727 is not permitted.
This server is for testing and demonstration purposes only and is not guaranteed to
support high availability or high volume. Data uploaded to the server may be stored
for internal purposes.
import grpc
import cobaltspeech.transcribe.v5.transcribe_pb2_grpc as stub
import cobaltspeech.transcribe.v5.transcribe_pb2 as transcribe
serverAddress = "demo.cobaltspeech.com:2727"
# Setup a gRPC connection with TLS. You can optionally provide your own
# root certificates and private key to grpc.ssl_channel_credentials()
# for mutually authenticated TLS.
creds = grpc.ssl_channel_credentials()
channel = grpc.secure_channel(serverAddress, creds)
client = stub.TranscribeServiceStub(channel)
# Get server version.
versionResp = client.Version(transcribe.VersionRequest())
print(versionResp)package main
import (
"context"
"crypto/tls"
"fmt"
"os"
"time"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials"
transcribepb "github.com/cobaltspeech/go-genproto/cobaltspeech/transcribe/v5"
)
func main() {
const (
serverAddress = "demo.cobaltspeech.com:2727"
connectTimeout = 10 * time.Second
)
// Setup a gRPC connection with TLS. You can optionally provide your own
// root certificates and private key through tls.Config for mutually
// authenticated TLS.
tlsCfg := tls.Config{}
creds := credentials.NewTLS(&tlsCfg)
ctx, cancel := context.WithTimeout(context.Background(), connectTimeout)
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(creds),
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := transcribepb.NewTranscribeServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &transcribepb.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
}Client Authentication
-
In some setups, it may be desired that the server should also validate clients connecting to it and only respond to the ones it can verify. If your Transcribe server is configured to do client authentication, you will need to present the appropriate certificate and key when connecting to it.
-
Please note that in the client-authentication mode, the client will still also verify the server’s certificate, and therefore this setup uses mutually authenticated TLS.
-
The following snippets show how to present client certificates when setting up the credentials. These could then be used in the same way as the examples above to connect to a TLS enabled server.
creds = grpc.ssl_channel_credentials(
root_certificates=root_certificates, # PEM certificate as byte string
private_key=private_key, # PEM client key as byte string
certificate_chain=certificate_chain, # PEM client certificate as byte string
)package main
import (
// ...
"crypto/tls"
"crypto/x509"
"fmt"
"os"
// ..
)
func main() {
// ...
// Root PEM certificate for validating self-signed server certificate
var rootCert []byte
// Client PEM certificate and private key.
var certPem, keyPem []byte
caCertPool := x509.NewCertPool()
if ok := caCertPool.AppendCertsFromPEM(rootCert); !ok {
fmt.Printf("unable to use given caCert\n")
os.Exit(1)
}
clientCert, err := tls.X509KeyPair(certPem, keyPem)
if err != nil {
fmt.Printf("unable to use given client certificate and key: %v\n", err)
os.Exit(1)
}
tlsCfg := tls.Config{
RootCAs: caCertPool,
Certificates: []tls.Certificate{clientCert},
}
creds := credentials.NewTLS(&tlsCfg)
// ...
}1.1.5 - Streaming Recognition
- The following example shows how to transcribe a audio stream using Transcribe’s
StreamingRecognizerequest. The stream can come from a file on disk or be directly from a microphone in real time.
Streaming from an audio file
-
We support several headered file formats including WAV, MP3, FLAC etc. For more details, please see the protocol buffer specification here.
-
The examples below use a WAV file as input to the streaming recognition. We will query the server for available models and use the first model to transcribe the speech.
import grpc
import cobaltspeech.transcribe.v5.transcribe_pb2_grpc as stub
import cobaltspeech.transcribe.v5.transcribe_pb2 as transcribe
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.TranscribeServiceStub(channel)
# Get server version.
versionResp = client.Version(transcribe.VersionRequest())
print(versionResp)
# Get list of models on the server.
modelResp = client.ListModels(transcribe.ListModelsRequest())
for model in modelResp.models:
print(model)
# Select a model ID from the list above. Going with the first model
# in this example.
modelID = modelResp.models[0].id
# Set the recognition config. We don't set the audio format and let the
# server auto-detect the format from the file header.
cfg = transcribe.RecognitionConfig(
model_id=modelID,
)
# Open audio file.
audio = open("test.wav", "rb")
# The first request to the server should only contain the
# recognition configuration. Subsequent requests should contain
# audio bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
yield transcribe.StreamingRecognizeRequest(config=cfg)
data = audio.read(bufferSize)
while len(data) > 0:
yield transcribe.StreamingRecognizeRequest(
audio=transcribe.RecognitionAudio(data=data),
)
data = audio.read(bufferSize)
# We also define a callback function to execute for each response.
# The example below just prints the formatted transcript to stdout.
def processResponse(resp):
result = resp.result
hyp = result.alternatives[0] # 1-best hypothesis.
transcript = hyp.transcript_formatted # Formatted transcript.
start = hyp.start_time_ms / 1000.0 # Converting to seconds.
end = start + hyp.duration_ms / 1000.0 # Converting to seconds.
newLine = "\r" if result.is_partial else "\n\n" # Will not move to new line for partial results.
print(f"[{start:0.2f}:{end:0.2f}] {transcript}", end=newLine)
# Streaming requests to the server.
for resp in client.StreamingRecognize(stream(cfg, audio)):
processResponse(resp)package main
import (
"context"
"errors"
"fmt"
"io"
"os"
"sync"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
transcribe "github.com/cobaltspeech/go-genproto/cobaltspeech/transcribe/v5"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := transcribe.NewTranscribeServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &transcribe.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
// Get list model of models on the server.
modelResp, err := client.ListModels(ctx, &transcribe.ListModelsRequest{})
if err != nil {
fmt.Printf("failed to get model list: %v\n", err)
os.Exit(1)
}
for _, m := range modelResp.Models {
fmt.Println(m)
}
fmt.Println()
// Selecting the first model.
cfg := &transcribe.RecognitionConfig{
ModelId: modelResp.Models[0].Id,
}
// Opening audio file.
audio, err := os.Open("test.wav")
if err != nil {
fmt.Printf("failed to open audio file: %v\n", err)
os.Exit(1)
}
defer audio.Close()
// Starting recognition.
err = StreamingRecognize(ctx, client, cfg, audio, printTranscript)
if err != nil {
fmt.Printf("failed to run streaming recognition: %v\n", err)
os.Exit(1)
}
}
// StreamingRecognize wraps the bidirectional streaming API for performing
// speech recognition. It sets up recognition using the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to cubic
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// As results are received from Transcribe server, they will be sent to the
// provided handlerFunc.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
//
// This function returns only after all results have been passed to the
// resultHandler.
func StreamingRecognize(
ctx context.Context,
client transcribe.TranscribeServiceClient,
cfg *transcribe.RecognitionConfig,
audio io.Reader,
handlerFunc func(*transcribe.StreamingRecognizeResponse),
) error {
const (
streamingBufSize = 1024
)
// Creating stream.
stream, err := client.StreamingRecognize(ctx)
if err != nil {
return err
}
// There are two concurrent processes going on. We will create a new
// goroutine to read audio and stream it to the server. This goroutine
// will receive results from the stream. Errors could occur in both
// go routines. We therefore setup a channel, errCh, to hold these
// errors. Both go routines are designed to send up to one error, and
// return immediately. Therefore we use a buffered channel with a
// capacity of two.
errCh := make(chan error, 2)
// start streaming audio in a separate goroutine
var wg sync.WaitGroup
wg.Add(1)
go func() {
if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
// if sendAudio encountered io.EOF, it's only a
// notification that the stream has closed. The actual
// status will be obtained in a subsequent Recv call, in
// the other goroutine below. We therefore only forward
// non-EOF errors.
errCh <- err
}
wg.Done()
}()
// Receive results from the stream.
for {
in, err := stream.Recv()
if errors.Is(err, io.EOF) {
break
}
if err != nil {
errCh <- err
break
}
handlerFunc(in)
}
wg.Wait()
select {
case err := <-errCh:
// There may be more than one error in the channel, but it is
// very likely they are related (e.g. connection reset causing
// both the send and recv to fail) and we therefore return the
// first error and discard the other.
return err
default:
return nil
}
}
// printTranscript is a callback function given to StreamingRecognize method to
// print results that are returned though the gRPC stream.
func printTranscript(resp *transcribe.StreamingRecognizeResponse) {
if resp.Error != nil {
fmt.Printf("\n[ERROR] server returned an error: %v\n", resp.Error)
return
}
hyp := resp.Result.Alternatives[0]
startTime := float32(hyp.StartTimeMs) / 1000.0
endTime := startTime + float32(hyp.DurationMs)/1000.0
if resp.Result.IsPartial {
fmt.Printf("\r[%0.2f:%0.2f] %s", startTime, endTime, hyp.TranscriptFormatted)
} else {
fmt.Printf("[%0.2f:%0.2f] %s\n\n", startTime, endTime, hyp.TranscriptFormatted)
}
}
// sendAudio sends audio to a stream.
func sendAudio(
stream transcribe.TranscribeService_StreamingRecognizeClient,
cfg *transcribe.RecognitionConfig,
audio io.Reader,
bufSize uint32,
) error {
// The first message needs to be a config message, and all subsequent
// messages must be audio messages.
// Send the recognition config
if err := stream.Send(&transcribe.StreamingRecognizeRequest{
Request: &transcribe.StreamingRecognizeRequest_Config{Config: cfg},
}); err != nil {
// if this failed, we don't need to CloseSend
return err
}
// Stream the audio.
buf := make([]byte, bufSize)
for {
n, err := audio.Read(buf)
if n > 0 {
if err2 := stream.Send(&transcribe.StreamingRecognizeRequest{
Request: &transcribe.StreamingRecognizeRequest_Audio{
Audio: &transcribe.RecognitionAudio{Data: buf[:n]},
},
}); err2 != nil {
// if we couldn't Send, the stream has
// encountered an error and we don't need to
// CloseSend.
return err2
}
}
if err != nil {
// err could be io.EOF, or some other error reading from
// audio. In any case, we need to CloseSend, send the
// appropriate error to errCh and return from the function
if err2 := stream.CloseSend(); err2 != nil {
return err2
}
if err != io.EOF {
return err
}
return nil
}
}
}Streaming from microphone
-
Streaming audio from microphone input basically requires a reader interface that can provided audio samples recorded from a microphone; typically this requires interaction with system libraries. Another option is to use an external command line tool like
soxto record and pipe audio into the client. -
The examples below use the latter approach by using the
reccommand provided withsoxto record and stream the audio.
#!/usr/bin/env python3
# This example assumes sox is installed on the system and is available
# in the system's PATH variable. Instead of opening a regular file from
# disk, we open a subprocess that executes sox's rec command to record
# audio from the system's default microphone.
import grpc
import cobaltspeech.transcribe.v5.transcribe_pb2_grpc as stub
import cobaltspeech.transcribe.v5.transcribe_pb2 as transcribe
import subprocess
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.TranscribeServiceStub(channel)
# Get server version.
versionResp = client.Version(transcribe.VersionRequest())
print(versionResp)
# Get list of models on the server.
modelResp = client.ListModels(transcribe.ListModelsRequest())
for model in modelResp.models:
print(model)
# Select a model ID from the list above. Going with the first model
# in this example.
m = modelResp.models[0]
modelID = m.id
# Setting audio format to be raw 16-bit signed little endian audio samples
# recorded at the sample rate expected by the model.
cfg = transcribe.RecognitionConfig(
model_id=modelID,
audio_format_raw=transcribe.AudioFormatRAW(
encoding="AUDIO_ENCODING_SIGNED",
bit_depth=16,
byte_order="BYTE_ORDER_LITTLE_ENDIAN",
sample_rate=m.attributes.sample_rate,
channels=1,
)
)
# Open microphone stream using sox's rec command and record
# audio using the config specified above.
cmd = f"rec --no-show-progress -t raw -r {m.attributes.sample_rate} -e signed -b 16 -L -c 1 -"
mic = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
audio = mic.stdout
try:
_ = audio.read(1024) # Trying to read some bytes as sanity check.
except Exception as err:
print(f"[ERROR] failed to read audio from mic stream: {err}")
print("\n[INFO] recording from microphone ... Press ctrl + c to exit\n")
# The first request to the server should only contain the
# recognition configuration. Subsequent requests should contain
# audio bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
yield transcribe.StreamingRecognizeRequest(config=cfg)
data = audio.read(bufferSize)
while len(data) > 0:
yield transcribe.StreamingRecognizeRequest(
audio=transcribe.RecognitionAudio(data=data),
)
data = audio.read(bufferSize)
# We also define a callback function to execute for each response.
# The example below just prints the formatted transcript to stdout.
def processResponse(resp):
result = resp.result
hyp = result.alternatives[0] # 1-best hypothesis.
transcript = hyp.transcript_formatted # Formatted transcript.
start = hyp.start_time_ms / 1000.0 # Converting to seconds.
end = start + hyp.duration_ms / 1000.0 # Converting to seconds.
newLine = "\r" if result.is_partial else "\n\n" # Will not move to new line for partial results.
print(f"[{start:0.2f}:{end:0.2f}] {transcript}", end=newLine)
# Streaming requests to the server.
try:
for resp in client.StreamingRecognize(stream(cfg, audio)):
processResponse(resp)
except KeyboardInterrupt:
# Stop streaming when ctrl + c pressed.
pass
except Exception as err:
print(f"[ERROR] failed to stream audio: {err}")
audio.close()
mic.kill()package main
import (
"context"
"errors"
"fmt"
"io"
"os"
"os/exec"
"os/signal"
"strings"
"sync"
"syscall"
"golang.org/x/sync/errgroup"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
transcribe "github.com/cobaltspeech/go-genproto/cobaltspeech/transcribe/v5"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := transcribe.NewTranscribeServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &transcribe.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
// Get list model of models on the server.
modelResp, err := client.ListModels(ctx, &transcribe.ListModelsRequest{})
if err != nil {
fmt.Printf("failed to get model list: %v\n", err)
os.Exit(1)
}
for _, m := range modelResp.Models {
fmt.Println(m)
}
fmt.Println()
// Selecting first model.
m := modelResp.Models[0]
// Setting audio format to be raw 16-bit signed little endian audio samples
// recorded at the sample rate expected by the model.
cfg := &transcribe.RecognitionConfig{
ModelId: m.Id,
AudioFormat: &transcribe.RecognitionConfig_AudioFormatRaw{
AudioFormatRaw: &transcribe.AudioFormatRAW{
Encoding: transcribe.AudioEncoding_AUDIO_ENCODING_SIGNED,
SampleRate: m.Attributes.SampleRate,
BitDepth: 16,
ByteOrder: transcribe.ByteOrder_BYTE_ORDER_LITTLE_ENDIAN,
Channels: 1,
},
},
}
// Open microphone stream using sox's rec command and record
// audio using the config specified above.
args := fmt.Sprintf("--no-show-progress -t raw -r %d -e signed -b 16 -L -c 1 -", m.Attributes.SampleRate)
cmd := exec.CommandContext(ctx, "rec", strings.Fields(args)...)
audio, err := cmd.StdoutPipe()
if err != nil {
fmt.Printf("failed to open microphone stream: %v\n", err)
os.Exit(1)
}
// Starting routines to record from microphone and stream to server
// using an errgroup.Group that returns if either one encounters an error.
eg, ctx := errgroup.WithContext(ctx)
eg.Go(func() error {
fmt.Printf("\n[INFO] recording from microphone ... Press ctrl + c to exit\n")
if err := cmd.Run(); err != nil {
return fmt.Errorf("record from microphone: %w", err)
}
return nil
})
eg.Go(func() error { return StreamingRecognize(ctx, client, cfg, audio, printTranscript) })
// Also using a routine to monitor for interrupts.
eg.Go(func() error {
const maxInterrupts = 10
interrupt := make(chan os.Signal, maxInterrupts)
signal.Notify(interrupt, os.Interrupt, syscall.SIGTERM)
<-interrupt
cancel()
return ctx.Err()
})
if err := eg.Wait(); err != nil && !errors.Is(err, ctx.Err()) {
fmt.Printf("failed to run streaming recognition: %v\n", err)
}
}
// StreamingRecognize wraps the bidirectional streaming API for performing
// speech recognition. It sets up recognition using the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to cubic
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// As results are received from Transcribe server, they will be sent to the
// provided handlerFunc.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
//
// This function returns only after all results have been passed to the
// resultHandler.
func StreamingRecognize(
ctx context.Context,
client transcribe.TranscribeServiceClient,
cfg *transcribe.RecognitionConfig,
audio io.Reader,
handlerFunc func(*transcribe.StreamingRecognizeResponse),
) error {
const (
streamingBufSize = 1024
)
// Creating stream.
stream, err := client.StreamingRecognize(ctx)
if err != nil {
return err
}
// There are two concurrent processes going on. We will create a new
// goroutine to read audio and stream it to the server. This goroutine
// will receive results from the stream. Errors could occur in both
// go routines. We therefore setup a channel, errCh, to hold these
// errors. Both go routines are designed to send up to one error, and
// return immediately. Therefore we use a buffered channel with a
// capacity of two.
errCh := make(chan error, 2)
// start streaming audio in a separate goroutine
var wg sync.WaitGroup
wg.Add(1)
go func() {
if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
// if sendAudio encountered io.EOF, it's only a
// notification that the stream has closed. The actual
// status will be obtained in a subsequent Recv call, in
// the other goroutine below. We therefore only forward
// non-EOF errors.
errCh <- err
}
wg.Done()
}()
// Receive results from the stream.
for {
in, err := stream.Recv()
if errors.Is(err, io.EOF) {
break
}
if err != nil {
errCh <- err
break
}
handlerFunc(in)
}
wg.Wait()
select {
case err := <-errCh:
// There may be more than one error in the channel, but it is
// very likely they are related (e.g. connection reset causing
// both the send and recv to fail) and we therefore return the
// first error and discard the other.
return err
default:
return nil
}
}
// printTranscript is a callback function given to StreamingRecognize method to
// print results that are returned though the gRPC stream.
func printTranscript(resp *transcribe.StreamingRecognizeResponse) {
if resp.Error != nil {
fmt.Printf("\n[ERROR] server returned an error: %v\n", resp.Error)
return
}
hyp := resp.Result.Alternatives[0]
startTime := float32(hyp.StartTimeMs) / 1000.0
endTime := startTime + float32(hyp.DurationMs)/1000.0
if resp.Result.IsPartial {
fmt.Printf("\r[%0.2f:%0.2f] %s", startTime, endTime, hyp.TranscriptFormatted)
} else {
fmt.Printf("[%0.2f:%0.2f] %s\n\n", startTime, endTime, hyp.TranscriptFormatted)
}
}
// sendAudio sends audio to a stream.
func sendAudio(
stream transcribe.TranscribeService_StreamingRecognizeClient,
cfg *transcribe.RecognitionConfig,
audio io.Reader,
bufSize uint32,
) error {
// The first message needs to be a config message, and all subsequent
// messages must be audio messages.
// Send the recognition config
if err := stream.Send(&transcribe.StreamingRecognizeRequest{
Request: &transcribe.StreamingRecognizeRequest_Config{Config: cfg},
}); err != nil {
// if this failed, we don't need to CloseSend
return err
}
// Stream the audio.
buf := make([]byte, bufSize)
for {
n, err := audio.Read(buf)
if n > 0 {
if err2 := stream.Send(&transcribe.StreamingRecognizeRequest{
Request: &transcribe.StreamingRecognizeRequest_Audio{
Audio: &transcribe.RecognitionAudio{Data: buf[:n]},
},
}); err2 != nil {
// if we couldn't Send, the stream has
// encountered an error and we don't need to
// CloseSend.
return err2
}
}
if err != nil {
// err could be io.EOF, or some other error reading from
// audio. In any case, we need to CloseSend, send the
// appropriate error to errCh and return from the function
if err2 := stream.CloseSend(); err2 != nil {
return err2
}
if err != io.EOF {
return err
}
return nil
}
}
}1.1.6 - Recognition Configurations
-
An in-depth explanation of the methods, data structures and types in the auto-generated SDKs can be found in the API Reference section. The sub-section on the
RecognitionConfigobject is particularly important here. This page discusses the common combinations of values set inRecognitionConfigsent to the server. -
First, here’s a quick overview of the fields in
RecognitionConfig.
| Field | Required | Default | Description |
|---|---|---|---|
| model_id | Yes | - | Unique ID of the model to use. |
| audio_format_raw | Yes for raw audio | - | Can be used to specify the details of raw audio samples recorded from a microphone stream, for example. |
| audio_format_headered | No | UNSPECIFIED |
Can be used when audio has a self-describing header such as WAV, FLAC, MP3, OPUS etc. If not set, transcribe-server will try to auto-detect the audio encoding from the header. |
| selected_audio_channels | No | [0] (mono) |
Specifies which channels of a multi-channel audio file to be transcribed, each as their own individual audio stream. |
| selected_audio_channels | No | [0] (mono) |
Specifies which channels of a multi-channel audio file to be transcribed, each as their own individual audio stream. |
| audio_time_offset_ms | No | 0 |
Can be used to indicate that the audio being streamed to the recognizer is offset from the original stream by the provided duration in milliseconds. This offset will be added to all timestamps in results returned by the recognizer. |
| enable_confusion_network | No | false |
Toggles the inclusion of a confusion network, consisting of multiple alternative transcriptions. The specified model must also support confusion networks for this field to be populated. |
| metadata | No | "" | Can be used to send any custom metadata associated with the audio being sent.The server may record this metadata when processing the request. The server does not use this field for any other purpose. |
| context | No | nil |
Can be used to provide any context information that can aid speech recognition, such as probable phrases or words that may appear in the recognition output or even out of vocabulary words for the model being used. Currently all context information must first be pre-compiled via the CompileContext(). |
Use cases
Transcribing Headered Files
- The most basic use case is getting a formatted transcript for a headered audio file such
foo.wav. This would simply need a config such as the following:
{
"model_id": "1",
}
- Transcribe will return one or more results depending on partial result frequency, end points in speech etc. each of which would like the following:
{
"error": null,
"result": {
"alternatives": [
{
"transcript_formatted": "Tomorrow is a new day.",
"transcript_raw": "TOMORROW IS A NEW DAY",
"start_time_ms": 180,
"duration_ms": 1425,
"confidence": 0.870,
},
{
"transcript_formatted": "Tomorrow is a you day.",
"transcript_raw": "TOMORROW IS A YOU DAY",
"start_time_ms": 180,
"duration_ms": 1425,
"confidence": 0.130,
}
// ...
// Other alternative hypotheses.
// ...
]
}
}
- If some sort of non-fatal error was encountered, then Transcribe will populate
the
errorfield. One such case maybe sending audio sampled at a lower sample rate than what the model is configured for (e.g. sending 8 kHz audio to a 16 kHz model):
{
"error": {
"message": "potential accuracy loss: input sample rate (8000) is lower than required (16000)"
},
"results": {
// ...
// Results
// ...
},
}
Transcribing Raw Audio Stream
- For transcribing raw audio streams, such as those coming in from a live
microphone, the details of the audio samples such as their sampling rate,
encoding etc. must be specified in the
RecognitionConfiglike so:
{
"model_id": "1",
"audio_format_raw": {
encoding="SIGNED",
bit_depth=16,
byte_order="LITTLE_ENDIAN",
sample_rate=16000,
channels=1,
}
}
- For various other encoding formats for raw samples, check AudioFormatRaw in the API specification.
Getting Word-level Details
- If you need to know the word-level details such as word timestamps, to align subtitles with a video, for example, then you can use the following config to enable those word-level timestamps.
{
"model_id": "1",
"enable_word_details": true
}
- Each alternative hypothesis in the returned results will have a
word_detailsfield containing details for both formatted and raw words:
{
"error": null,
"result": {
"alternatives": [
{
"transcript_formatted": "Tomorrow is a new day.",
"transcript_raw": "TOMORROW IS A NEW DAY",
"start_time_ms": 180,
"duration_ms": 1425,
"confidence": 0.870,
"word_details": {
"formatted": [
{ "word": "Tomorrow", "confidence": 1.0, "start_time_ms": 180, "duration_ms": 800 },
{ "word": "is", "confidence": 1.0, "start_time_ms": 980, "duration_ms": 120 },
{ "word": "a", "confidence": 1.0, "start_time_ms": 1100, "duration_ms": 120 },
{ "word": "new", "confidence": 0.870, "start_time_ms": 1220, "duration_ms": 210 },
{ "word": "day.", "confidence": 1.0, "start_time_ms": 1450, "duration_ms": 155 },
],
"raw": [
{ "word": "TOMORROW", "confidence": 1.0, "start_time_ms": 180, "duration_ms": 800 },
{ "word": "IS", "confidence": 1.0, "start_time_ms": 980, "duration_ms": 120 },
{ "word": "A", "confidence": 1.0, "start_time_ms": 1100, "duration_ms": 120 },
{ "word": "NEW", "confidence": 0.870, "start_time_ms": 1220, "duration_ms": 210 },
{ "word": "DAY", "confidence": 1.0, "start_time_ms": 1450, "duration_ms": 155 },
],
}
},
// ...
// Other alternative hypotheses.
// ...
]
}
}
Getting Confusion Networks
-
For applications that need more than the one-best transcription, the most comprehensive and detailed results are found in the confusion network. Please refer to the in-depth confusion network documentation to see what is included.
-
To enable the confusion network, the following config can be used:
{
"model_id": "1",
"enable_confusion_network": true
}
- The confusion network will be accessible at the
cnetfield in the results returned:
{
"error": null,
"result": {
"alternatives": [
{
"transcript_formatted": "Tomorrow is a new day.",
"transcript_raw": "TOMORROW IS A NEW DAY",
"start_time_ms": 180,
"duration_ms": 1425,
"confidence": 0.870,
},
{
"transcript_formatted": "Tomorrow is a you day.",
"transcript_raw": "TOMORROW IS A YOU DAY",
"start_time_ms": 180,
"duration_ms": 1425,
"confidence": 0.130,
}
// ...
// Other alternative hypotheses.
// ...
],
"cnet": {
"links": [
{
"start_time_ms": 180,
"duration_ms": 800,
"arcs": [
{ "word": "TOMORROW", "confidence": 1.0 }
]
},
{
"start_time_ms": 980,
"duration_ms": 120,
"arcs": [
{ "word": "IS", "confidence": 1.0 }
]
},
{
"start_time_ms": 1100,
"duration_ms": 120,
"arcs": [
{ "word": "A", "confidence": 1.0 }
]
},
{
"start_time_ms": 1220,
"duration_ms": 210,
"arcs": [
{ "word": "NEW", "confidence": 0.870 },
{ "word": "YOU", "confidence": 0.130 }
]
},
{
"start_time_ms": 1450,
"duration_ms": 155,
"arcs": [
{ "word": "DAY", "confidence": 1.0 }
]
}
]
}
}
}
1.1.7 - Recognition Context
-
Cobalt Transcribe allows users to send context information with a recognition request which may aid the speech recognition. For example, if you have a list of names that you want to make sure the Transcribe model recognizes correctly, with the correct spelling, then you may provide the list in the form of a
RecognitionContextobject along with theRecognitionConfigbefore streaming data. -
Transcribe models allow different sets of “context tokens” each of which can be paired with a list of words or phrases. For example, a Transcribe model may have a context token for airport names, and you can provide a list of airport names you want to be recognized correctly for this context token. Likewise, models may also be configured with tokens for “contact list names”, “menu items”, “medical jargon” etc.
To ensure that there is no added latency in processing the list of words or
phrases during a recognition request, we have a API method called
CompileContext() that allows the user to
compile the list into a compact, efficient format for passing to the
StreamingRecognize() method.
Compiling Recognition Context
- The following snippet shows an example of how to compile context data and then send it during a recognition request.
import grpc
import cobaltspeech.transcribe.v5.transcribe_pb2_grpc as stub
import cobaltspeech.transcribe.v5.transcribe_pb2 as transcribe
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.TranscribeServiceStub(channel)
# Get server version.
versionResp = client.Version(transcribe.VersionRequest())
print(versionResp)
# Get list of models on the server.
modelResp = client.ListModels(transcribe.ListModelsRequest())
for model in modelResp.models:
print(model)
# Select a model ID from the list above. Going with the first model
# in this example. Also printing list of allowed context tokens.
m = modelResp.models[0]
print(f"context tokens = {m.attributes.context_info.allowed_context_tokens}")
# Let's say this model has an allowed context token called "airport_names" and
# we have a list of airport names that we want to make sure the recognizer gets
# right. We compile the list of names using the CompileContext(), save the compiled
# data and send it back with subsequent recognize requests to customize and improve
# the results.
#
# More typically, general models have a "catch-all" token called "unk:default" which
# can be used to boost the probabilities of any type word, as well as add words that
# are not in the model's vocabulary.
phrases = ["NARITA", "KUALA LUMPUR INTERNATIONAL", "ISTANBUL ATATURK", "LAGUARDIA"]
token = m.attributes.context_info.allowed_context_tokens[0] # "unk:default"
compileReq = transcribe.CompileContextRequest(
model_id=m.id,
token=token,
phrases=[ transcribe.ContextPhrase(text=t) for t in phrases ],
)
# Sending compilation request.
compiledResp = client.CompileContext(compileReq)
# Saving the compiled result for later use; note this compiled data is only
# compatible with the model whose ID was provided in the CompileContext call
compiledContexts = []
compiledContexts.append(compiledResp.context)
# Set the recognition config. We don't set the audio format and let the
# server auto-detect the format from the file header.
cfg = transcribe.RecognitionConfig(
model_id=m.id,
context=transcribe.RecognitionContext(compiled=compiledContexts),
)
# Open audio file.
audio = open("test.wav", "rb")
# The first request to the server should only contain the
# recognition configuration. Subsequent requests should contain
# audio bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
yield transcribe.StreamingRecognizeRequest(config=cfg)
data = audio.read(bufferSize)
while len(data) > 0:
yield transcribe.StreamingRecognizeRequest(
audio=transcribe.RecognitionAudio(data=data),
)
data = audio.read(bufferSize)
# We also define a callback function to execute for each response.
# The example below just prints the formatted transcript to stdout.
def processResponse(resp):
result = resp.result
hyp = result.alternatives[0] # 1-best hypothesis.
transcript = hyp.transcript_formatted # Formatted transcript.
start = hyp.start_time_ms / 1000.0 # Converting to seconds.
end = start + hyp.duration_ms / 1000.0 # Converting to seconds.
newLine = "\r" if result.is_partial else "\n\n" # Will not move to new line for partial results.
print(f"[{start:0.2f}:{end:0.2f}] {transcript}", end=newLine)
# Streaming requests to the server.
for resp in client.StreamingRecognize(stream(cfg, audio)):
processResponse(resp)package main
import (
"context"
"errors"
"fmt"
"io"
"log"
"os"
"sync"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
transcribe "github.com/cobaltspeech/go-genproto/cobaltspeech/transcribe/v5"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := transcribe.NewTranscribeServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &transcribe.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
// Get list model of models on the server.
modelResp, err := client.ListModels(ctx, &transcribe.ListModelsRequest{})
if err != nil {
fmt.Printf("failed to get model list: %v\n", err)
os.Exit(1)
}
for _, m := range modelResp.Models {
fmt.Println(m)
}
fmt.Println()
// Select a model ID from the list above. Going with the first model
// in this example. Also printing list of allowed context tokens.
m := modelResp.Models[0]
fmt.Printf("context tokens = %v\n", m.Attributes.ContextInfo.AllowedContextTokens)
// Let's say this model has an allowed context token called "airport_names" and
// we have a list of airport names that we want to make sure the recognizer gets
// right. We compile the list of names using the CompileContext(), save the compiled
// data and send it back with subsequent recognize requests to customize and improve
// the results.
//
// More typically, general models have a "catch-all" token called "unk:default" which
// can be used to boost the probabilities of any type word, as well as add words that
// are not in the model's vocabulary.
phrases := []string{"NARITA", "KUALA LUMPUR INTERNATIONAL", "ISTANBUL ATATURK", "LAGUARDIA"}
token := m.Attributes.ContextInfo.AllowedContextTokens[0] // "unk:default"
compileReq := &transcribe.CompileContextRequest{
ModelId: m.Id,
Token: token,
Phrases: make([]*transcribe.ContextPhrase, 0, len(phrases)),
}
for _, t := range phrases {
compileReq.Phrases = append(compileReq.Phrases, &transcribe.ContextPhrase{
Text: t,
})
}
// Sending compilation request.
compiledResp, err := client.CompileContext(context.Background(), compileReq)
if err != nil {
log.Fatal(err)
}
// Saving the compiled result for later use; note this compiled data is only
// compatible with the model whose ID was provided in the CompileContext call
compiledContexts := []*transcribe.CompiledContext{compiledResp.Context}
// Set the recognition config. We don't set the audio format and let the
// server auto-detect the format from the file header.
cfg := &transcribe.RecognitionConfig{
ModelId: m.Id,
Context: &transcribe.RecognitionContext{
Compiled: compiledContexts,
},
}
// Opening audio file.
audio, err := os.Open("test.wav")
if err != nil {
fmt.Printf("failed to open audio file: %v\n", err)
os.Exit(1)
}
defer audio.Close()
// Starting recognition.
err = StreamingRecognize(ctx, client, cfg, audio, printTranscript)
if err != nil {
fmt.Printf("failed to run streaming recognition: %v\n", err)
os.Exit(1)
}
}
// StreamingRecognize wraps the bidirectional streaming API for performing
// speech recognition. It sets up recognition using the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to cubic
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// As results are received from Transcribe server, they will be sent to the
// provided handlerFunc.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
//
// This function returns only after all results have been passed to the
// resultHandler.
func StreamingRecognize(
ctx context.Context,
client transcribe.TranscribeServiceClient,
cfg *transcribe.RecognitionConfig,
audio io.Reader,
handlerFunc func(*transcribe.StreamingRecognizeResponse),
) error {
const (
streamingBufSize = 1024
)
// Creating stream.
stream, err := client.StreamingRecognize(ctx)
if err != nil {
return err
}
// There are two concurrent processes going on. We will create a new
// goroutine to read audio and stream it to the server. This goroutine
// will receive results from the stream. Errors could occur in both
// go routines. We therefore setup a channel, errCh, to hold these
// errors. Both go routines are designed to send up to one error, and
// return immediately. Therefore we use a buffered channel with a
// capacity of two.
errCh := make(chan error, 2)
// start streaming audio in a separate goroutine
var wg sync.WaitGroup
wg.Add(1)
go func() {
if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
// if sendAudio encountered io.EOF, it's only a
// notification that the stream has closed. The actual
// status will be obtained in a subsequent Recv call, in
// the other goroutine below. We therefore only forward
// non-EOF errors.
errCh <- err
}
wg.Done()
}()
// Receive results from the stream.
for {
in, err := stream.Recv()
if errors.Is(err, io.EOF) {
break
}
if err != nil {
errCh <- err
break
}
handlerFunc(in)
}
wg.Wait()
select {
case err := <-errCh:
// There may be more than one error in the channel, but it is
// very likely they are related (e.g. connection reset causing
// both the send and recv to fail) and we therefore return the
// first error and discard the other.
return err
default:
return nil
}
}
// printTranscript is a callback function given to StreamingRecognize method to
// print results that are returned though the gRPC stream.
func printTranscript(resp *transcribe.StreamingRecognizeResponse) {
if resp.Error != nil {
fmt.Printf("\n[ERROR] server returned an error: %v\n", resp.Error)
return
}
hyp := resp.Result.Alternatives[0]
startTime := float32(hyp.StartTimeMs) / 1000.0
endTime := startTime + float32(hyp.DurationMs)/1000.0
if resp.Result.IsPartial {
fmt.Printf("\r[%0.2f:%0.2f] %s", startTime, endTime, hyp.TranscriptFormatted)
} else {
fmt.Printf("[%0.2f:%0.2f] %s\n\n", startTime, endTime, hyp.TranscriptFormatted)
}
}
// sendAudio sends audio to a stream.
func sendAudio(
stream transcribe.TranscribeService_StreamingRecognizeClient,
cfg *transcribe.RecognitionConfig,
audio io.Reader,
bufSize uint32,
) error {
// The first message needs to be a config message, and all subsequent
// messages must be audio messages.
// Send the recognition config
if err := stream.Send(&transcribe.StreamingRecognizeRequest{
Request: &transcribe.StreamingRecognizeRequest_Config{Config: cfg},
}); err != nil {
// if this failed, we don't need to CloseSend
return err
}
// Stream the audio.
buf := make([]byte, bufSize)
for {
n, err := audio.Read(buf)
if n > 0 {
if err2 := stream.Send(&transcribe.StreamingRecognizeRequest{
Request: &transcribe.StreamingRecognizeRequest_Audio{
Audio: &transcribe.RecognitionAudio{Data: buf[:n]},
},
}); err2 != nil {
// if we couldn't Send, the stream has
// encountered an error and we don't need to
// CloseSend.
return err2
}
}
if err != nil {
// err could be io.EOF, or some other error reading from
// audio. In any case, we need to CloseSend, send the
// appropriate error to errCh and return from the function
if err2 := stream.CloseSend(); err2 != nil {
return err2
}
if err != io.EOF {
return err
}
return nil
}
}
}1.1.8 - Hybrid vs End-to-End Models
Cobalt’s Transcribe engine supports two types of models:
- Hybrid (gen-1) - A Hybrid model consists of a sequence of independent models that when chained together can convert audio to words. This type of model no longer produces state-of-the art accuracy but still remains dominant in many commercial ASR applications.
- End-to-End (gen-2) - An End-to-End (E2E) model is mostly a single large neural network that can convert audio directly to text transcripts (or something very close that requires little additional processing)
A hybrid model can be viewed as a sequence of several different models glued together in a particular sequence to convert audio to text. The cascade of models will (1) convert audio to features based on the amount of energy in different frequency ranges, (2) use a neural network to predict context-dependent sounds (phones) that are present for every ~10 milliseconds of audio, (3) convert the context-dependent phones to context-endependent phones (ex: the ’th’ sound in ’the’ would be an example phone), (4) convert the sequence of phones into a sequence of words using a lexicon model that is a manually curated dictionary of words and the expected sounds/pronunciations for each word, (5) convert the identified possible sequences of words into the most likely sequence of words using a Language Model trained on a large amount of text (helps to choose between ambiguity like “WRECK A NICE BEACH” vs “RECOGNIZE SPEECH”).
An End-to-End speech recognition model is fairly straightforward by comparison. Rather than a series of small models, the bulk of the transcript is performed by one large neural network model. Depending on the specific E2E architecture, there may be a small amount of light feature generation on the input side of E2E the neural network and a little bit of processing on the output side to put together the transcript, but the decoding process is much more straightforward than the hybrid approach and most of the word is performed in one large neural network.
Selecting between Hybrid and E2E Model Types
Advantages of E2E (gen-2) models:
- High Accuracy - Our E2E models push the state-of-the art in speech recognition accuracy in a variety of diverse use cases, and typically produce 30-50% fewer errors than gen-1 models. You can take a look at the word error rates of both hybrid and E2E models on several industry-standard test datasets here.
- Sample Rate Flexibility - All of our E2E models can transcribe both 8khz telephone audio and 16khz audio without any loss in accuracy.
- Out-of-Vocabulary Word Support - Even words never seen during training can very often be recognized correctly.
- Parallel Processing - The transcription of a single audio file can be easily run in parallel across multiple CPUs or on a GPU.
- Easier Training - training models is more straightforward (if adapting or fully re-training a model to be optimized for a particular use case).
- Low Resource Language Support - Less information is required about a new language to train a model for a new language. The phones/sounds, pronunciations of words, and vocabulary are not required. Usually, much less training data is also be needed to produce a suitable recognition model.
Advantages of Hybrid (gen-1) models:
- Low Latency - Our hybrid models can achieve very low latency (<100ms). E2E models can be run with settings that reduce latency, but for those models there is an impact to Word-Error-Rate and the latency will not be as low as the hybrid model.
- Efficient Transcription - Each CPU core can transcribe several audio streams at the same time.
- Easy Customization - New words and/or pronunciations can be added at transcription time. We also offer tools that allow you to re-build models with your own text-only data that allows the Hybrid model to be more accurate on a target domain.
- More suitable for Embedded Devices - Hybrid models can be trained to have relatively light CPU/memory/storage requirements if it needs to run on an embedded device
- Constrained Use Cases - If transcription is being deployed in a use case that has a limited vocabulary and/or grammar (not general transcription), the hybrid model can be trained or adapted to target this use case and achieve extremely high accuracy. Examples would be voice command-and-control of a device, or users speaking from a list of commands. Multiple grammars can even be supported and swapped in/out of the recognizer when it is running.
- Confidence - Per-word confidence estimates are more accurate for Hybrid models than E2E models.
- Alternate Words - Hybrid models can return rich results beyond a 1-best transcript that contains potential alternate words/sentences for the transcribed audio.
- Less Compute Required for Training - Training and adapting speech models requires fewer GPU/compute resources.
End-to-End models are likely to be the best choice for customers that are primarily concerned with maximizing accuracy for general transcription. However, the hybrid models may be more appropriate and even more accurate under some conditions: very low latency streaming, low compute/memory embedded transcription, highly custom/unique vocabulary, a very narrow domain (ex: speaking a small number of device directed commands), vocabulary or expected command sets that can change often (even between each audio stream passed as input). By supporting both types of models and offering several different options for model customization, Cobalt is able to satisfy nearly any use case that a customer may require. The support of multiple model types also future-proofs the service by ensuring that if an improved type of speech recognition model becomes available in the future, users of cobalt-transcribe will be able to start using it with minimal changes to their API.
Our gen-2 E2E models currently do not support word level confidence, confusion networks outputs, recognition context, or GPU support during decoding. However, these features will be added to the E2E models soon.
1.1.9 - API Reference
The API is defined as a protobuf spec, so native bindings can be generated in any language with gRPC support. We recommend using buf to generate the bindings.
This section of the documentation is auto-generated from the protobuf spec. The service contains the methods that can be called, and the “messages” are the data structures (objects, classes or structs in the generated code, depending on the language) passed to and from the methods.
TranscribeService
Service that implements the Cobalt Transcribe Speech Recognition API.
Version
Version(VersionRequest) VersionResponse
Queries the version of the server.
ListModels
ListModels(ListModelsRequest) ListModelsResponse
Retrieves a list of available speech recognition models.
StreamingRecognize
StreamingRecognize(StreamingRecognizeRequest) StreamingRecognizeResponse
Performs bidirectional streaming speech recognition. Receive results while sending audio. This method is only available via GRPC and not via HTTP+JSON. However, a web browser may use websockets to use this service.
CompileContext
CompileContext(CompileContextRequest) CompileContextResponse
Compiles recognition context information, such as a specialized list of
words or phrases, into a compact, efficient form to send with subsequent
StreamingRecognize requests to customize speech recognition. For example,
a list of contact names may be compiled in a mobile app and sent with each
recognition request so that the app user’s contact names are more likely to
be recognized than arbitrary names. This pre-compilation ensures that there
is no added latency for the recognition request. It is important to note
that in order to compile context for a model, that model has to support
context in the first place, which can be verified by checking its
ModelAttributes.ContextInfo obtained via the ListModels method. Also,
the compiled data will be model specific; that is, the data compiled for
one model will generally not be usable with a different model.
Messages
- If two or more fields in a message are labeled oneof, then each method call using that message must have exactly one of the fields populated
- If a field is labeled
repeated, then the generated code will accept an array (or struct, or list depending on the language).
AudioFormatRAW
Details of audio in raw format
Fields
-
encoding (AudioEncoding ) Encoding of the samples. It must be specified explicitly and using the default value of
AUDIO_ENCODING_UNSPECIFIEDwill result in an error. -
bit_depth (uint32 ) Bit depth of each sample (e.g. 8, 16, 24, 32, etc.). This is a required field.
-
byte_order (ByteOrder ) Byte order of the samples. This field must be set to a value other than
BYTE_ORDER_UNSPECIFIEDwhen thebit_depthis greater than 8. -
sample_rate (uint32 ) Sampling rate in Hz. This is a required field.
-
channels (uint32 ) Number of channels present in the audio. E.g.: 1 (mono), 2 (stereo), etc. This is a required field.
CompileContextRequest
The top-level message sent by the client for the CompileContext request. It
contains a list of phrases or words, paired with a context token included in
the model being used. The token specifies a category such as “menu_item”,
“airport”, “contact”, “product_name” etc. The context token is used to
determine the places in the recognition output where the provided list of
phrases or words may appear. The allowed context tokens for a given model can
be found in its ModelAttributes.ContextInfo obtained via the ListModels
method.
Fields
-
model_id (string ) Unique identifier of the model to compile the context information for. The model chosen needs to support context which can be verified by checking its
ModelAttributes.ContextInfoobtained viaListModels. -
token (string ) The token that is associated with the provided list of phrases or words (e.g “menu_item”, “airport” etc.). Must be one of the tokens included in the model being used, which can be retrieved by calling the
ListModelsmethod. -
phrases (ContextPhrase repeated) List of phrases and/or words to be compiled.
CompileContextResponse
The message returned to the client by the CompileContext method.
Fields
- context (CompiledContext ) Context information in a compact form that is efficient for use in subsequent recognition requests. The size of the compiled form will depend on the amount of text that was sent for compilation. For 1000 words it’s generally less than 100 kilobytes.
CompiledContext
Context information in a compact form that is efficient for use in subsequent recognition requests. The size of the compiled form will depend on the amount of text that was sent for compilation. For 1000 words it’s generally less than 100 kilobytes.
Fields
- data (bytes )
The context information compiled by the
CompileContextmethod.
ConfusionNetworkArc
An Arc inside a Confusion Network Link
Fields
-
word (string ) Word in the recognized transcript
-
confidence (double ) Confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized.
-
features (ConfusionNetworkArcFeatures ) Features related to this arc
ConfusionNetworkArcFeatures
Features related to confusion network arcs
Fields
- confidence (map ConfusionNetworkArcFeatures.ConfidenceEntry repeated) A map of features that are used for recalculating confidence scores of this confusion network arc
ConfusionNetworkArcFeatures.ConfidenceEntry
Fields
ConfusionNetworkLink
A Link inside a confusion network
Fields
-
start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio received by the recognizer and corresponding to the start of this link
-
duration_ms (uint64 ) Duration in milliseconds of the current link in the confusion network
-
arcs (ConfusionNetworkArc repeated) Arcs between this link
ContextInfo
Model information specifc to supporting recognition context.
Fields
-
supports_context (bool ) If this is set to true, the model supports taking context information into account to aid speech recognition. The information may be sent with with recognition requests via RecognitionContext inside RecognitionConfig.
-
allowed_context_tokens (string repeated) A list of tokens (e.g “name”, “airport” etc.) that serve has placeholders in the model where a client provided list of phrases or words may be used to aid speech recognition and produce the exact desired recognition output.
ContextPhrase
A phrase or word that is to be compiled into context information that can be
later used to improve speech recognition during a StreamingRecognize call.
Along with the phrase or word itself, there is an optional boost parameter
that can be used to boost the likelihood of the phrase or word in the
recognition output.
Fields
-
text (string ) The actual phrase or word.
-
boost (float ) This is an optional field. The boost factor is a positive number which is used to multiply the probability of the phrase or word appearing in the output. This setting can be used to differentiate between similar sounding words, with the desired word given a bigger boost factor.
By default, all phrases or words provided in the
RecongitionContextare given an equal probability of occurring. Boost factors larger than 1 make the phrase or word more probable and boost factors less than 1 make it less likely. A boost factor of 2 corresponds to making the phrase or word twice as likely, while a boost factor of 0.5 means half as likely.
ListModelsRequest
The top-level message sent by the client for the ListModels method.
ListModelsResponse
The message returned to the client by the ListModels method.
Fields
- models (Model repeated) List of models available for use that match the request.
Model
Description of a Transcribe Model
Fields
-
id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for recognition, and is specified in the
RecognitionConfigmessage. -
name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their recognition task.
-
attributes (ModelAttributes ) Model attributes
ModelAttributes
Attributes of a Transcribe Model
Fields
-
sample_rate (uint32 ) Audio sample rate supported by the model
-
context_info (ContextInfo ) Attributes specifc to supporting recognition context.
RecognitionAlternative
A recognition hypothesis
Fields
-
transcript_formatted (string ) Text representing the transcription of the words that the user spoke.
The transcript will be formatted according to the servers formatting configuration. If you want the raw transcript, please see the field
transcript_raw. If the server is configured to not use any formatting, then this field will contain the raw transcript.As an example, if the spoken utterance was “four people”, and the server was configured to format numbers, this field would be set to “4 people”.
-
transcript_raw (string ) Text representing the transcription of the words that the user spoke, without any formatting applied. If you want the formatted transcript, please see the field
transcript_formatted.As an example, if the spoken utterance was
four people, this field would be set to “FOUR PEOPLE”. -
start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio received by the recognizer and corresponding to the start of this utterance.
-
duration_ms (uint64 ) Duration in milliseconds of the current utterance in the spoken audio.
-
confidence (double ) Confidence estimate between 0 and 1. A higher number represents a higher likelihood of the output being correct.
-
word_details (WordDetails ) Word-level details corresponding to the transcripts. This is available only if
enable_word_detailswas set totruein theRecognitionConfig.
RecognitionAudio
Audio to be sent to the recognizer
Fields
- data (bytes )
RecognitionConfig
Configuration for setting up a Recognizer
Fields
-
model_id (string ) Unique identifier of the model to use, as obtained from a
Modelmessage. -
oneof audio_format.audio_format_raw (AudioFormatRAW ) Audio is raw data without any headers
-
oneof audio_format.audio_format_headered (AudioFormatHeadered ) Audio has a self-describing header. Headers are expected to be sent at the beginning of the entire audio file/stream, and not in every
RecognitionAudiomessage.The default value of this type is AUDIO_FORMAT_HEADERED_UNSPECIFIED. If this value is used, the server may attempt to detect the format of the audio. However, it is recommended that the exact format be specified.
-
selected_audio_channels (uint32 repeated) This is an optional field. If the audio has multiple channels, this field can be configured with the list of channel indices that should be considered for the recognition task. These channels are 0-indexed.
Example:
[0]for a mono file,[0, 1]for a stereo file. Example:[1]to only transcribe the second channel of a stereo file.If this field is not set, all the channels in the audio will be processed.
Channels that are present in the audio may be omitted, but it is an error to include a channel index in this field that is not present in the audio. Channels may be listed in any order but the same index may not be repeated in this list.
BAD:
[0, 2]for a stereo file; BAD:[0, 0]for a mono file. -
audio_time_offset_ms (uint64 ) This is an optional field. It can be used to indicate that the audio being streamed to the recognizer is offset from the original stream by the provided duration in milliseconds. This offset will be added to all timestamps in results returned by the recognizer.
The default value of this field is 0ms, so the timestamps in the recognition result will not be modified.
Example use case where this field can be helpful: if a recognition session was interrupted and audio needs to be sent to a new session from the point where the session was previously interrupted, the offset could be set to the point where the interruption had happened.
-
enable_word_details (bool ) This is an optional field. If this is set to
true, each result will include word level details of the transcript. These details are specified in theWordDetailsmessage. If set tofalse, no word-level details will be returned. The default isfalse. -
enable_confusion_network (bool ) This is an optional field. If this is set to true, each result will include a confusion network. If set to
false, no confusion network will be returned. The default isfalse. If the model being used does not support returning a confusion network, this field will have no effect. Tokens in the confusion network always correspond to tokens in thetranscript_rawreturned. -
metadata (RecognitionMetadata ) This is an optional field. If there is any metadata associated with the audio being sent, use this field to provide it to the recognizer. The server may record this metadata when processing the request. The server does not use this field for any other purpose.
-
context (RecognitionContext ) This is an optional field for providing any additional context information that may aid speech recognition. This can also be used to add out-of-vocabulary words to the model or boost recognition of specific proper names or commands. Context information must be pre-compiled via the
CompileContext()method.
RecognitionConfusionNetwork
Confusion network in recognition output
Fields
- links (ConfusionNetworkLink repeated)
RecognitionContext
A collection of additional context information that may aid speech recognition. This can be used to add out-of-vocabulary words to the model or to boost recognition of specific proper names or commands.
Fields
- compiled (CompiledContext repeated)
List of compiled context information, with each entry being compiled from a
list of words or phrases using the
CompileContextmethod.
RecognitionError
Developer-facing error message about a non-fatal recognition issue.
Fields
- message (string )
RecognitionMetadata
Metadata associated with the audio to be recognized.
Fields
- custom_metadata (string ) Any custom metadata that the client wants to associate with the recording. This could be a simple string (e.g. a tracing ID) or structured data (e.g. JSON).
RecognitionResult
A recognition result corresponding to a portion of audio.
Fields
-
alternatives (RecognitionAlternative repeated) An n-best list of recognition hypotheses alternatives
-
is_partial (bool ) If this is set to true, it denotes that the result is an interim partial result, and could change after more audio is processed. If unset, or set to false, it denotes that this is a final result and will not change.
Servers are not required to implement support for returning partial results, and clients should generally not depend on their availability.
-
cnet (RecognitionConfusionNetwork ) If
enable_confusion_networkwas set to true in theRecognitionConfig, and if the model supports it, a confusion network will be available in the results. -
audio_channel (uint32 ) Channel of the audio file that this result was transcribed from. Channels are 0-indexed, so the for mono audio data, this value will always be 0.
StreamingRecognizeRequest
The top-level messages sent by the client for the StreamingRecognize
method. In this streaming call, multiple StreamingRecognizeRequest messages
should be sent. The first message must contain a RecognitionConfig message
only, and all subsequent messages must contain RecognitionAudio only. All
RecognitionAudio messages must contain non-empty audio. If audio content is
empty, the server may choose to interpret it as end of stream and stop
accepting any further messages.
Fields
-
oneof request.config (RecognitionConfig )
-
oneof request.audio (RecognitionAudio )
StreamingRecognizeResponse
The messages returned by the server for the StreamingRecognize request.
Multiple messages of this type will be delivered on the stream, for multiple
results, as soon as results are available from the audio submitted so far. If
the audio has multiple channels, the results of all channels will be
interleaved. Results of each individual channel will be chronological.
However, there is no guarantee of the order of results across channels.
Clients should process both the result and error fields in each message.
At least one of these fields will be present in the message. If both result
and error are present, the result is still valid.
Fields
-
result (RecognitionResult ) A new recognition result. This field will be unset if a new result is not yet available.
-
error (RecognitionError ) A non-fatal error message. If a server encountered a non-fatal error when processing the recognition request, it will be returned in this message. The server will continue to process audio and produce further results. Clients can continue streaming audio even after receiving these messages. This error message is meant to be informational.
An example of when these errors maybe produced: audio is sampled at a lower rate than expected by model, producing possibly less accurate results.
This field will be unset if there is no error to report.
VersionRequest
The top-level message sent by the client for the Version method.
VersionResponse
The message sent by the server for the Version method.
Fields
- version (string ) Version of the server handling these requests.
WordDetails
Fields
-
formatted (WordInfo repeated) Word-level information corresponding to the
transcript_formattedfield. -
raw (WordInfo repeated) Word-level information corresponding to the
transcript_rawfield.
WordInfo
Word level details for recognized words in a transcript
Fields
-
word (string ) The actual word in the text
-
confidence (double ) Confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized.
-
start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio received by the recognizer and corresponding to the start of this spoken word.
-
duration_ms (uint64 ) Duration in milliseconds of the current word in the spoken audio.
Enums
AudioEncoding
The encoding of the audio data to be sent for recognition.
| Name | Number | Description |
|---|---|---|
| AUDIO_ENCODING_UNSPECIFIED | 0 | AUDIO_ENCODING_UNSPECIFIED is the default value of this type and will result in an error. |
| AUDIO_ENCODING_SIGNED | 1 | PCM signed-integer |
| AUDIO_ENCODING_UNSIGNED | 2 | PCM unsigned-integer |
| AUDIO_ENCODING_IEEE_FLOAT | 3 | PCM IEEE-Float |
| AUDIO_ENCODING_ULAW | 4 | G.711 mu-law |
| AUDIO_ENCODING_ALAW | 5 | G.711 a-law |
AudioFormatHeadered
| Name | Number | Description |
|---|---|---|
| AUDIO_FORMAT_HEADERED_UNSPECIFIED | 0 | AUDIO_FORMAT_HEADERED_UNSPECIFIED is the default value of this type. |
| AUDIO_FORMAT_HEADERED_WAV | 1 | WAV with RIFF headers |
| AUDIO_FORMAT_HEADERED_MP3 | 2 | MP3 format with a valid frame header at the beginning of data |
| AUDIO_FORMAT_HEADERED_FLAC | 3 | FLAC format |
| AUDIO_FORMAT_HEADERED_OGG_OPUS | 4 | Opus format with OGG header |
ByteOrder
Byte order of multi-byte data
| Name | Number | Description |
|---|---|---|
| BYTE_ORDER_UNSPECIFIED | 0 | BYTE_ORDER_UNSPECIFIED is the default value of this type. |
| BYTE_ORDER_LITTLE_ENDIAN | 1 | Little Endian byte order |
| BYTE_ORDER_BIG_ENDIAN | 2 | Big Endian byte order |
Scalar Value Types
1.1.10 - FAQ
System Requirements
Does Cobalt Transcribe run on Linux?
Yes, you can run Cobalt Transcribe on Linux natively or via Docker. Check out the documentation to get started.
Does Cobalt Transcribe run on macOS?
Yes, you can run Cobalt Transcribe on macOS via Docker Desktop for evaluation purpose. Check out the documentation to get started. However, we don’t recommend running Cobalt Transcribe on macOS in production.
Does Cobalt Transcribe run on Windows?
Yes, Windows is supported via Docker Desktop for evaluation. We don’t recommend running Cobalt Transcribe on Windows in production.
Does Cobalt Transcribe run on embedded devices?
Yes, Cobalt Transcribe supports embedded devices such as Raspberry Pi, Tegra etc. However, you’ll probably want to contact us for a smaller model due to memory limitations.
Does Cobalt Transcribe run on Android or iOS?
Android and iOS require a specific implementation strategy. Please contact us for support working with Android or iOS.
What are the technical requirements for a scaled on-premise deployment?
Each containerized instance of Cobalt Transcribe should be provided with 4 cores and 8 GB RAM when using for streaming recognition.
Product Features
Which languages does Cobalt Transcribe support?
Cobalt offers speech recognition in English (US & UK), Spanish, French, German, Russian, Brazilian Portuguese, Korean, Japanese, Swahili, Cambodian. Please contact sales@cobaltspeech.com to learn more. Cobalt is always looking for partners to develop, sell, and/or market speech technology in other languages.
Can I use Cobalt Transcribe in the field of telephony such as contact centers?
Yes, Cobalt Transcribe uses low latency and/or 8kHz telephony models for transcribing telephone calls and contact center conversations. Additional insight is retrivable through high-precision timestamps and n-best transcripts. Cobalt technology provides solutions for contact centers including summarization, redaction, and sentiment analysis of conversations.
Can I redact Personal Identifiable Inforamtion (PII) from the output transcript?
PII redaction is a separate service that can be integrated with Cobalt Transcribe. Please contact us for details.
Does Cobalt Transcribe support real-time transcription?
Yes, Cobalt Transcribe can accept audio samples as they are recorded and will provide streaming output with relatively low latency. It also supports output of partial results, which is available almost immediately during decoding. This feature is useful in real-time interfaces where users can see what is being recognized nearly immediately as they speak, but some words in the preliminary output may be corrected in a final result as more audio and context is available. Cobalt transcribe performs automatic end-pointing to determine the end of utterance.
Recognition Accuracy
How accurate is Cobalt Transcribe?
Cobalt Transcribe is available in two different architectures: Hybrid and End-to-End. We have evaluated the word error rate (WER) of both versions of Cobalt Transcribe on several industry-standard test datasets:
| Dataset | Domain | Hybrid WER | End-to-End WER |
|---|---|---|---|
| CommonVoice-test | Read Speech, Crowd Sourced | 11.5% | 5.0% |
| Librispeech-test | Read Speech, Audiobooks, Crowd Sourced | 6.2% | 2.2% |
| Tedlium-test | Spontaneous Speech, Presentations | 7.5% | 3.9% |
| WallStreetJournal-test | Read Speech, News | 7.4% | 5.8% |
| MultilingualLibriSpeech-test | Read Speech, Audiobooks, Crowd Sourced | 8.8% | 4.0% |
| OHSU-test | Spontaneous Speech, Children’s Speech | 16.9% | 12.4% |
The WER is dependent on a number of factors such as the train-test split, formatting of the decoded transcript, accuracy vs. latency trade-offs, etc. Therefore, these numbers are not directly comparable to the WERs reported by other service providers, even on the same dataset.
How do I further improve audio transcription accuracy?
Our base models are trained on a large amount of audio and text to ensure robust accuracy on a variety of use cases. The configurable nature of Cobalt Transcribe’s models allows for updates that can improve transcription accuracy specific to your use-case:
-
Adding vocabulary and context via the
RecognitionContextAPI: This will help you to capture proper names and domain-specific terminology correctly. -
End-to-end (E2E) models typically have better accuracy and more robust recognition performance for different accents and dialects. However, E2E models are more computationally expensive, and tend to have higher latency when compared to hybrid models. If you would like to try one out please contact us at sales@cobaltspeech.com
-
Continuous adaptation of acoustic models (AMs) using Cobalt Transcribe Tuner. This continuous learning framework automatically updates the acoustic model using your production data. For more information, contact sales@cobaltspeech.com.
-
Cobalt’s speech scientists can work with you to optimize accuracy for your conditions and application: For speech recognition in a specific acoustic environment or domain-specific use case (e.g. noisy factory floor, airport, surgical lab, patient-doctor conversations, quarterly earnings calls, etc.) we can adapt the acoustic and language models using relevant audio and text data.
I am starting a new speech project. How can I get the best transcription accuracy?
The transcription accuracy depends on several factors such as:
- Appropriate sampling rate (8kHz / 16kHz) and matching model
- Audio format : Lossless codecs like wav or flac may be better over mp3 or ogg
- Microphone selection, placement, directionality (Cardiod, Omni)
- Trade-offs between latency and accuracy.
- Consider constrained grammar or providing recognition context
- AM and LM adaptation
Recognition Speed/Performance
How do I further improve audio transcription latency?
One way to improve the latency is to make the streaming buffer size smaller. We recommend setting the streaming buffer size between 512 bytes and 4096 bytes. We can also work with you to tune model parameters such as beam search width to reduce the latency. Our speech scientists can also make a smaller model for your application, or tune parameters and model size for optimal latency and accuracy trade-offs. If you’re interested in this, contact sales@cobaltspeech.com.
How long does it take to transcribe audio?
The processing speed of speech to text conversion is measured by the real time factor (RTF) which is the ratio of time taken to transcribe an audio file to the duration of the audio. Cobalt Transcribe has an RTF of 0.16 and 0.4 using our general purpose hybrid and E2E models, respectively. That means transcribing one hour of speech typically takes approximately 10 minutes for the hybrid model and about 24 minutes for the E2E model.
Are there limits on the number of jobs that can be processed concurrently?
Number of concurrent audio channels depends on the models being used and the CPU. Our general purpose models typically support 6 channels per core for realtime streams when running on a CPU such as C6i EC2 instances.
What does it cost, in terms of CPU resources, to transcribe a million minutes of speech?
With our standard models, Cobalt Transcribe can run 6 channels per core for realtime speech input, assuming a c6i EC2 processor. A 4-core processor can therefore transcribe 24 minutes of audio per minute of wall-clock time. At current AWS pricing, an EC2 instance costs $0.17 per hour. Therefore, the cost for a million minutes of speech is approximately $120.
Costs can be significantly lower when using c7g instances on EC2. Contact us for more information.
How scalable is Cobalt Transcribe? How can I carry out large deployments?
Cobalt Transcribe can scale between large-scale servers and low-power embedded hardware. For large-scale deployments, you can increase the number of concurrent audio channels for faster decoding. Cobalt Transcribe has the capability to decode using separate threads. Our general purpose models typically support 6 channels per core when running on a CPU such as C6i EC2 instances. Moreover, you can deploy Cobalt Transcribe via docker and kubernetes to automatically scale up (or down) your resources according to your demand in a cost effective manner without causing a decline in performance.
Recognition accuracy vs speed/performance
How do I choose between hybrid and end-to-end models?
End-to-End models are likely to be the best choice for customers that are primarily concerned with maximizing accuracy for general transcription. However, the hybrid models may be more appropriate and even more accurate under some conditions: very low latency streaming, low compute/memory embedded transcription, highly custom/unique vocabulary, a very narrow domain (ex: speaking a small number of device directed commands), vocabulary or expected command sets that can change often (even between each audio stream passed as input). For detailed comparison between hybrid and end-to-end models, you may take a look at Hybrid vs End-to-End Models.
Supported Audio Formats
What type of media files does Cobalt Transcribe support?
Cobalt Transcribe supports common media formats, such as WAV, MP3, FLAC, Ogg and audio encoding like PCM, mu-law, and a-law. Raw audio format is also supported.
My audio source is 48kHz/44.1kHz. Does Cobalt Transcribe support that?
Yes, we resample the audio to an appropriate sampling rate automatically. Please note that our default sampling rates are 16kHz for wideband models and 8kHz for telephony models. Accuracy improvements for higher sampling rates (than 16kHz) are minimal, and not generally worth the associated increase in data rates and data transfer requirements, or the additional overhead for resampling.
API and Integration
How do I test Cobalt Transcribe?
We are happy to offer free trials under a software evaluation license of Cobalt Transcribe with all available features. Typically, our software evaluation licenses are for a period of 30 days. To get started with Cobalt Transcribe, please check the quick start.
Please try our Cobalt Transcribe Speech Recognition Demo for simple evaluation purposes. This demo server is for testing and demonstration purposes only and is not guaranteed to support high availability or high volume.
Which SDKs are available to integrate Cobalt Transcribe into my project?
Cobalt Transcribe uses gRPC to define its APIs. The API is defined as a protobuf schema, and grpc tools can be used to generate client SDKs in several languages, including Python, Go, C++, Java, C#, etc.
Can I use my own models with Cobalt Transcribe?
Cobalt Transcribe models are trained on thousands of hours of data and produce very accurate transcripts over a wide range of different conditions. We provide tools and services that allow our models to be tailored towards your particular use case if additional accuracy is desired. If customers have their own existing Kaldi or wav2vec 2.0 models, Cobalt Transcribe supports the use of those external models.
Product Comparison
What are the benefits of Cobalt Transcribe over other speech-to-text services?
Compared to other services, Cobalt Transcribe offers the following advantages:
- You can host the Cobalt Transcribe server on your system locally or in your virtual private cloud. This enables you to keep your data private and secure.
- Cobalt Transcribe has low latency. It is particularly useful for embedded devices and real-time applications.
- Cobalt Transcribe is highly customizable. Adaptaing the language and acoustic models to your specific terminology will improve performance.
- Cobalt’s experienced speech scientists are available to adapt the LM and AM to your target domain for the best recognition results.
- The Cobalt Transcribe API supports several outputs: 1-best results, per-word start times and durations, per-word confidences, n-best transcripts, confusion networks, and lattices.
1.1.11 -

Cobalt Transcribe SDK – Cobalt
2 - Voice Intelligence
2.1 - Privacy Screen
Cobalt’s Privacy Screen engine can redact various categories of sensitive information automatically from text and audio. Every business that collects or deals with personal data should redact sensitive information in order to protect customer privacy, comply with laws and regulations, and discover new business opportunities.
Privacy Screen makes audio and text redaction possible in real-time with the advantage of our low latency and accurate speech recognition engine, Transcribe, combined with a robust redaction backend engine that identifies several types of sensitive or confidential information. There are several categories:
- Personally Identifying Information (PII) such as names, addresses, phone numbers etc.
- Protected Health Information (PHI) such as medical conditions, injuries, names of medication etc.
- Payment Card Industry (PCI) such as credit card and bank details.
A detailed list of all the categories that are identified by Privacy Screen can be found here.
How does redaction work
Sensitive information redaction typically works as a two step process. First, a machine learning model detects and classifies the desired entities in the text. Then, this classification is used to determine if the entity needs to be redacted, and if it does, the entity is replaced with an entity label in the redacted transcript. Currently, Cobalt uses state-of-the-art deep neural network (DNN) model for PII, PHI and PCI redaction.
There are three different options of using Cobalt’s redaction solution:
- Redact PII from a text transcript
- Redact PII from an audio file
- Redact PII from an audio file with a text transcript
Each of these services can be used in two operating modes:
- Streaming mode: Redaction will run utterance by utterance, and output will be streamed out as soon as the result is ready.
- Batch mode: All input audio/transcript will be processed, redacted in one batch and the output will be available at the end of the process.
Redact PII from a text transcript
In this use case, you can identify and redact sensitive PII from an input text transcript. Detected PII entities are replaced with an appropriate PII token in the redacted text transcript. Both the input and redacted transcripts are specified as JSON with a list of utterances. Each utterance has a list of words that has:
- Text
- Redaction class
- Redaction confidence score
You can specify the desired redaction classes applicable for your use case in the config file.
Redact PII from an audio file
In this case, the input audio file is first transcribed using Cobalt’s transcribe API and then text redaction is applied on the ASR generated transcript. Detected PII entities are replaced with an appropriate PII token in the redacted text transcript. In the output, you can get:
- Redacted text transcript
- Unredacted text transcript
- Redacted audio file where the PII has been masked with a beep sound The redacted text transcript contains a redaction confidence score, ASR confidence score, and associated starting and ending timestamps for each utterance and/or word.
Redact PII from both an audio file with a text transcript
In this use case, an audio file and associated transcript is given as input in order to get the redacted transcript and redacted audio file as output. The input transcript should be specified as JSON with a list of utterances:
- Each utterance has:
- Audio Channel in the audio file. Indexed from 0
- A list of words.
Each word has:
- Text
- Timestamp in the audio file where this word starts (in milliseconds)
- Duration of this word in the audio file (in milliseconds)
Output transcript has the same format as the input, except each word has extra fields such as “redaction_class”, “redaction_confidence”, and “is_redacted”.
Text Redaction
Here is an example of text redaction:
| Raw text | Redacted text |
|---|---|
| Good morning, everybody. My name is Robert, and today I am going to share some personal information with you. I live at 123 Park Ave Apt 123 New York City, NY 10002. My Social Security number is 999999999, credit card number is 6666666666666666, and CVV code is 777. I love cats. | Good morning, everybody. My name is [NAME], and today I am going to share some personal information with you. I live at [LOCATION_ADDRESS] [LOCATION_CITY], [LOCATION_ZIP]. My Social Security number is [SSN], credit card number is [CREDIT_CARD], and CVV code is [CVV]. I love cats. |
System requirements
Minimum requirements
| Minimum | Recommended (Text only) | Recommended (All Features) | Recommended Concurrency |
|---|---|---|---|
| CPU | Any x86 (Intel or AMD) processor with 6GB RAM and 50GB disk volume | Intel Sapphire Rapids or newer CPUs supporting AMX with 16GB RAM and 50GB disk volume | Intel Sapphire Rapids or newer CPUs supporting AMX with 64GB RAM and 100GB disk volume |
| GPU | Any x86 (Intel or AMD) processor with 28GB RAM. Nvidia GPU with compute capability 7.0 or higher (Volta or newer) and at least 16GB VRAM. 100GB disk volume | Any x86 (Intel or AMD) processor with 32GB RAM and Nvidia Tesla T4 GPU. 100GB disk volume | Any x86 (Intel or AMD) processor with 64GB RAM and Nvidia Tesla T4 GPU. 100GB disk volume |
Recommended requirements for CPU container
| Platform | Recommended Instance Type (Text only) | Recommended Instance Type (All Features) |
|---|---|---|
| Azure | Standard_E2_v5 (2 vCPUs, 16GB RAM) | Standard_E8_v5 (8 vCPUs, 64GB RAM) |
| AWS | M7i.large (2 vCPUs, 8GB RAM) | m7i.4xlarge (16 vCPUs, 64GB RAM) |
| GCP | N2-Standard-2 (2 vCPUs, 8GB RAM) | N2-Standard-16 (16 vCPUs, 64GB RAM) |
Recommended requirements for GPU container
| Platform | Recommended Instance Type (Text only) | Recommended Instance Type (All Features) |
|---|---|---|
| Azure | Standard_NC8as_T4_v3 | Standard_NC8as_T4_v3 |
| AWS | G4dn.2xlarge | G4dn.4xlarge |
| GCP | N1-Standard-8 + Tesla T4 | N1-Standard-16 + Tesla T4 |
2.1.1 - Server Setup
Installing Cobalt Privacy Screen
Cobalt distributes a docker-compose file that orchestrates three docker images, one for each of the following services:
- Privacy Screen Server (frontend for accepting text / audio streams)
- Transcribe Server (for recognizing text in audio files)
- Redaction Backend Engine (for redacting text data)
Having these components as separate images facilitates large deployments where each image can be auto-scaled independently based on request traffic.
Installing Server
-
Contact Cobalt to get a link to the image files in AWS S3 and the docker-compose configuration file. This link will expire in two weeks, so be sure to download the file to your own server.
-
Download with the AWS CLI if you have it, or with curl:
URL="the url sent by Cobalt" FILE_NAME="name you want to give the file (should end with the same extension as the url, usually tar.bz2)" curl $URL -L -o $FILE_NAME -
Untar the file, and load the docker images. The tar file will also contain the
docker-compose.yamlfile.tar -xvjf $FILE_NAME -C ./ docker load < *.bz2 -
Copy the cobalt license file into the server folder
-
Copy the deid license file into the server folder
-
Start the services using
docker-compose:docker-compose up --build
The server will be running in the container and listening on port 2728 for gRPC requests from clients.
2.1.2 - Connecting to the Server
Once you have the Cobalt Privacy Screen server up and running, you are ready to create a client connection.
First, you need to know the address (host:port) where the server is
running. This document will assume the values 127.0.0.1:9002, but
these can be replaced with your server address in actual code.
Default Connection
The following code snippet connects to the server and queries its version. It uses our recommended default setup, expecting the server to be listening on a TLS encrypted connection.
package main
import (
"context"
"fmt"
"log"
"github.com/cobaltspeech/sdk-trifid/grpc/go-trifid"
)
const serverAddr = "127.0.0.1:9002"
func main() {
client, err := trifid.NewClient(serverAddr)
if err != nil {
log.Fatal(err)
}
// Be sure to close the client when we are done with it.
defer client.Close()
}import trifid
client = trifid.Client(server_address="localhost:9002")Insecure Connection
It is sometimes required to connect to Privacy Screen server without TLS enabled (during debugging, for example). Note that if the server has TLS enabled, attempting to connect with an insecure client will fail.
To create an insecure connection, do the following when creating the client:
client, err := trifid.NewClient(serverAddr, trifid.WithInsecure())client = trifid.Client(server_address="localhost:9002", insecure=True)Client Authentication
In our recommended default setup, TLS is enabled in the gRPC setup, and when connecting to the server, clients validate the server’s SSL certificate to make sure they are talking to the right party. This is similar to how “https” connections work in web browsers.
In some setups, it may be desired that the server should also validate clients connecting to it and only respond to the ones it can verify. If your Privacy Screen server is configured to do client authentication, you will need to present the appropriate certificate and key when connecting to it.
Please note that in the client-authentication mode, the client will still also verify the server’s certificate, and therefore this setup uses mutually authenticated TLS. This can be done with:
// certPem and keyPem are the bytes of the client certificate and key
// provided to you.
client, err := trifid.NewClient(serverAddr, trifid.WithClientCert(certPem, keyPem))# cert_pem and key_pem are the contents of the client certificate and key
# provided to you.
client = trifid.Client(server_address="localhost:9002", client_certificate=cert_pem client_key=key_pem)Server Information
The client provides two methods to get information about the server - Version and ListModels.
Version
The Version method provides information about the version of the Privacy Screen server the client is connected to, as well as information about other relevant services and packages the server uses.
// Request the server version info
ver, err := client.Version(context.Background())
fmt.Printf("Server Version: %v\n", ver)# Request the server version info
ver = client.version()
print(f"Server Version: {ver}")List Models
The ListModels method fetches a list of models available on the Privacy Screen server. On the server side, the models are specified as part of the server’s config file.
// Request the list of models
modelList, err := client.ListModels(context.Background())
fmt.Printf("Available Models:\n")
for _, mdl := range modelList.Models {
fmt.Printf(" ID: %v\n", mdl.Id)
fmt.Printf(" Name: %v\n", mdl.Name)
fmt.Printf(" Redaction Classes: %v\n", mdl.RedactionClasses)
}# Request the list of models
model_list = client.list_models()
print("Available Models:")
for mdl in model_list:
print(f" ID: {mdl.id}")
print(f" Name: {mdl.name}")
print(f" Redaction Classes: {mdl.redaction_classes}")2.1.3 - Text Redaction
TODO
2.1.4 - Concurrency
The recommended level of concurrency, i.e. the optimal number of simultaneous requests to make to the container is covered below for the CPU and GPU containers. The recommended concurrency level is driven primarily by the compute requirement of the Neural Network models, such as for PII detection.
CPU
For Neural Network inference workloads, CPUs don’t require inputs to be batched together to achieve good hardware utilization. In practice, due to network overhead and pre/post-processing code it is best to use a low level of concurrency such as 2 per container instance. If latency isn’t a concern, a value of 32 is recommended.
GPU
Unlike CPUs, GPUs require inputs to be batched together and processed as a single large input to achieve optimal hardware utilization. This means that there is a tradeoff between latency and throughput. A concurrency level of 32 per container instance is a good tradeoff between latency and throughput, however concurrency levels as low as 8 do not significantly impact throughput. If latency isn’t a concern, a value of 128 will ensure maximum hardware utilization.
2.1.5 - Privacy Screen Client
This release includes a PrivacyScreen client that can be used to quickly send audio/transcripts to the server. This reads audio files in WAV (PCM16SLE) format and transcripts in the JSON format, you can see examples for txt and json files:
Jack and Jill went up to 224 North Hill drive to fetch a pail of water. Jack fell down broke his crown and Jill called 4125555555.{
"utterances": [
{
"start_time_ms": 30,
"duration_ms": 4230,
"audio_channel": 0,
"words": [
{
"start_time_ms": 30,
"duration_ms": 390,
"text": "Jack"
},
{
"start_time_ms": 420,
"duration_ms": 120,
"text": "and"
},
{
"start_time_ms": 540,
"duration_ms": 240,
"text": "Jill"
},
{
"start_time_ms": 780,
"duration_ms": 240,
"text": "went"
},
{
"start_time_ms": 1020,
"duration_ms": 150,
"text": "up"
},
{
"start_time_ms": 1170,
"duration_ms": 60,
"text": "to"
},
{
"start_time_ms": 1230,
"duration_ms": 1080,
"text": "224"
},
{
"start_time_ms": 2310,
"duration_ms": 300,
"text": "North"
},
{
"start_time_ms": 2610,
"duration_ms": 150,
"text": "Hill"
},
{
"start_time_ms": 2760,
"duration_ms": 300,
"text": "drive"
},
{
"start_time_ms": 3060,
"duration_ms": 90,
"text": "to"
},
{
"start_time_ms": 3150,
"duration_ms": 270,
"text": "fetch"
},
{
"start_time_ms": 3420,
"duration_ms": 60,
"text": "a"
},
{
"start_time_ms": 3480,
"duration_ms": 270,
"text": "pail"
},
{
"start_time_ms": 3750,
"duration_ms": 120,
"text": "of"
},
{
"start_time_ms": 3870,
"duration_ms": 390,
"text": "water."
}
]
},
{
"start_time_ms": 9300,
"duration_ms": 5324,
"audio_channel": 1,
"words": [
{
"start_time_ms": 9300,
"duration_ms": 420,
"text": "Jack"
},
{
"start_time_ms": 9720,
"duration_ms": 210,
"text": "fell"
},
{
"start_time_ms": 9930,
"duration_ms": 420,
"text": "down"
},
{
"start_time_ms": 10410,
"duration_ms": 270,
"text": "broke"
},
{
"start_time_ms": 10680,
"duration_ms": 150,
"text": "his"
},
{
"start_time_ms": 10830,
"duration_ms": 450,
"text": "crown"
},
{
"start_time_ms": 11310,
"duration_ms": 180,
"text": "and"
},
{
"start_time_ms": 11490,
"duration_ms": 210,
"text": "Jill"
},
{
"start_time_ms": 11700,
"duration_ms": 330,
"text": "called"
},
{
"start_time_ms": 12030,
"duration_ms": 2594,
"text": "4125555555."
}
]
}
]
}Examples of client calls
There are several ways the client interacts with the server. These examples are always run from the same path as the client binary is. When in doubt, run ./privacy-screen-gprc-client -h to get more information the parameters needed to run the client.
Redact Text
./privacy-screen-grpc-client redact-text \
--insecure \
--model-id general \
--input-text input.txt \
--output-result redacted_token.json
Redact Transcript
./privacy-screen-grpc-client redact-transcript \
--insecure \
--model-id general \
--input-transcript testdata/input.json \
--output-transcript redacted_output.json
Redact Transcribed Audio
./privacy-screen-grpc-client redact-transcribed-audio \
--insecure \
--model-id general \
--input-audio testdata/input.wav \
--input-transcript testdata/input.json \
--output-audio redacted_output.wav \
--output-transcript redacted_output.json \
--timeout 5m
Transcribe and Redact
./privacy-screen-grpc-client transcribe-and-redact \
--insecure \
--model-id en_US \
--input-audio testdata/input.wav \
--output-audio redacted_output.wav \
--output-transcript redacted_output.json \
--output-unredacted-transcript unredacted_output.json \
--timeout 5m
2.1.6 - Redaction Categories
Personally Identifiable Information (PII)
| Label | Description | Regulatory Compliance |
|---|---|---|
| ACCOUNT_NUMBER | Customer account or membership identification number Policy No. 10042992; Member ID: HZ-5235-001 Note: Full support for English; Multilingual support in progress |
HIPAA_SAFE_HARBOR, CCI |
| AGE | Numbers associated with an individual’s age 27 years old; 18 months old More detailsWhen given in years, only the number is flagged, but both number and time unit are flagged when given in other units like months or weeks Also includes age ranges: 29-35 years old; 18+; A man in his forties |
GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI |
| DATE | Specific calendar dates, which can include days of the week, dates, months, or years Friday, Dec. 18, 2002; Dated: 02/03/97 See also: DATE_INTERVAL, DOB More detailsIf no calendar date is specified, days of the week are not flagged: Your appointment is on Monday Indexical terms are not flagged: yesterday; tomorrow |
HIPAA_SAFE_HARBOR, Quebec Privacy Act, CCI |
| DATE_INTERVAL | Broader time periods, including date ranges, months, seasons, years, and decades 2020-2021; 5-9 May; January 1984 See also: DATE, DOB |
HIPAA_SAFE_HARBOR, CCI |
| DOB | Dates of birth Born: March 7, 1961 See also: DATE, DATE_INTERVAL |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI |
| DRIVER_LICENSE | Driver's permit numbers DL# 134711-320 See also: VEHICLE_ID More detailsIncludes International Driving Permits (IDP) and Pilot’s licenses |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI |
| DURATION | Periods of time, specified as a number and a unit of time 8 months; 2 years Note: Full support for English; Multilingual support in progress |
|
| EMAIL_ADDRESS | Email addresses info@cobaltspeech.com |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI |
| EVENT | Names of events or holidays Olympics; Yom Kippur |
|
| FILENAME | Names of computer files, including the extension or filepath Taxes/2012/brad-tax-returns.pdf |
CCI |
| GENDER | Terms indicating gender identity, including slang terms. Note that performance is stronger for terms that are more likely to occur in formal documents, such as "male", "transgender", "non-binary", "female", "M", "F", etc. Other terms, such as "woman", "gentleman", etc., may not be captured in every context. female; trans |
CPRA, GDPR, GDPR Sensitive, APPI Sensitive |
| HEALTHCARE_NUMBER | Healthcare numbers and health plan beneficiary numbers Policy No.: 5584-486-674-YM More detailsIncludes medical record numbers, health insurance policy/account numbers, and member IDs, for example, German Sozialversicherungsnummer (also used as SSN), Philippine PhilHealth ID number, Ukrainian VHI number |
CPRA, GDPR, HIPAA, Quebec Privacy Act, APPI |
| IP_ADDRESS | Internet IP address, including IPv4 and IPv6 formats 192.168.0.1 2001:db8:0:0:0:8a2e::7334 |
CPRA, GDPR, HIPAA, Quebec Privacy Act, APPI |
| LANGUAGE | Names of natural languages Korean; French |
GDPR, GDPR Sensitive, APPI Sensitive |
| LOCATION | Metaclass for any named location reference; See subclasses below Eritrea; Lake Victoria More detailsMay co-occur with ORGANIZATION when the context refers explicitly to the organization’s location The patient was transferred to Northwest General Hospital |
GDPR, HIPAA_SAFE_HARBOR, APPI, CCI |
| LOCATION_ADDRESS | Full or partial physical mailing addresses, which can include: building name or number, street, city, county, state, country, zip code 25/300 Adelaide T., Perth WA 6000, Aus. 145 Windsor St. Mail to: Kollwitzstr 13, 10405, Berlin |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI |
| LOCATION_ADDRESS_STREET | A subclass of LOCATION_ADDRESS, covering: a building number and street name, plus information like a unit numbers, office numbers, floor numbers and building names, where applicable 25/300 Adelaide T., Perth WA 6000, Aus. 145 Windsor St. Mail to: Kollwitzstr 13, 10405, Berlin |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI |
| LOCATION_CITY | Municipality names, including villages, towns, and cities Toronto; Berlin; Denpasar |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI |
| LOCATION_COORDINATE | Geographic positions referred to using latitude, longitude, and/or elevation coordinates We’re at 40.748440 and -73.984559 |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI |
| LOCATION_COUNTRY | Country names Canada; Namibia |
GDPR, APPI, CCI |
| LOCATION_STATE | State, province, territory, or prefecture names Ontario; Arkansas; Ich lebe in NRW |
GDPR, APPI, CCI |
| LOCATION_ZIP | Zip codes (including Zip+4), postcodes, or postal codes 90210; B2N 3E3 More detailsOptimized for various English-speaking locales (Australia, Canada, United Kingdom, United States), as well as international equivalents |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI |
| MARITAL_STATUS | Terms indicating marital status single; common-law; ex-wife; married |
APPI Sensitive |
| MONEY | Names and/or amounts of currency 15 pesos; $94.50 |
CCI |
| NAME | Names of individuals, not including personal titles such as ‘Mrs.’ or ‘Mr.’ Dwayne Johnson; Mr. Khanna |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI |
| NAME_FAMILY | Names indicating a person’s family or community; often a last name in Western cultures and first name in Eastern cultures François Truffaut; Ozu Yasujirō |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI |
| NAME_GIVEN | Names given to an individual, usually at birth; often first / middle names in Western cultures and middle / last names in Eastern cultures François Truffaut; Ozu Yasujirō |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI |
| NAME_MEDICAL_ PROFESSIONAL | Full names, including professional titles and certifications, of medical professional, such as doctors and nurses Attending physician: Dr. Kay Martinez, MD |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI |
| NUMERICAL_PII | Numerical PII (including alphanumeric strings) that doesn't fall under other categories. See also a section below on international variants as some of them are mapped to this category, for example, Belgian BTW nummer or European VAT number. More detailsIncludes the following: numbers in the medical field, such as device serial numbers, POS codes, NPI numbers, etc.; computer numbers like MAC addresses, cookie IDs, VPNs, error codes, access codes, message IDs, etc.; business-related numbers like DUNS numbers, company registration numbers, provider IDs, etc.; numbers related to purchasing, like order IDs, transaction numbers, confirmation numbers, tracking numbers, etc.; also numbers assigned to various forms of IDs, files, documents, proceedings, invoices, claim IDs, record IDs, etc. |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI |
| OCCUPATION | Job titles or professions professor; actors; engineer; CPA |
Quebec Privacy Act, APPI, CCI |
| ORGANIZATION | Names of organizations or departments within an organization BHP; McDonald's; LAPD More detailsMay co-occur with LOCATION when the context refers explicitly to the organization’s location Donations can be brought to Royal Canadian Legion Branch 43 |
Quebec Privacy Act, APPI, CCI |
| ORGANIZATION_MEDICAL_ FACILITY | Names of medical facilities, such as hospitals, clinics, pharmacies, etc. Northwest General Hospital; Union Family Health Clinic |
Quebec Privacy Act, APPI |
| ORIGIN | Terms indicating nationality, ethnicity, or provenance Canadian; Sri Lankan |
CPRA, GDPR, GDPR Sensitive, Quebec Privacy Act, APPI Sensitive |
| PASSPORT_NUMBER | Passport numbers, issued by any country PA4568332; NU3C6L86S12 |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI |
| PASSWORD | Account passwords, PINs, access keys, or verification answers 27%alfalfa; temp1234 My mother's maiden name is Smith |
CPRA, APPI, CCI |
| PHONE_NUMBER | Telephone or fax numbers +4917643476050 |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI |
| PHYSICAL_ATTRIBUTE | Distinctive bodily attributes, including terms indicating race I'm 190cm tall; He belongs to the Black students’ association |
CPRA, GDPR, GDPR Sensitive, APPI Sensitive |
| POLITICAL_AFFILIATION | Terms referring to a political party, movement, or ideology liberal; Republican |
CPRA, GDPR, GDPR Sensitive, Quebec Privacy Act, APPI Sensitive |
| RELIGION | Terms indicating religious affiliation Hindu; Presbyterian |
CPRA, GDPR, GDPR Sensitive, Quebec Privacy Act, APPI Sensitive |
| SEXUALITY | Terms indicating sexual orientation, including slang terms bisexual; gay; straight |
CPRA, GDPR, GDPR Sensitive, APPI Sensitive |
| SSN | Social Security Numbers or international equivalent government identification numbers 078-05-1120; ***-***-3256 More detailsIncludes, for example, Australian TFN, Belgian NISS, British NIN, Canadian SIN, Dutch BSN, German Sozialversicherungsnummer (also used as a healthcare number, see: HEALTHCARE_NUMBER), French INSEE, Indian Aadhaar, Italian TIN, Philippine SSS, Spanish NUSS, Ukrainian TIN, and Mexican NSS formats. Flags mentions of complete numbers as well as the last four digits only. |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI |
| TIME | Expressions indicating clock times 19:37:28; 10pm EST |
CCI |
| URL | Internet addresses www.cobaltspeech.com |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, CCI |
| USERNAME | Usernames, login names, or handles cobaltspeechandlanguage; @_CobaltSpeechAndLanguage |
CPRA, GDPR, APPI |
| VEHICLE_ID | Vehicle identification numbers (VINs), vehicle serial numbers, and license plate numbers 5FNRL38918B111818; BIF7547 See also: DRIVER_LICENSE |
CPRA, GDPR, HIPAA_SAFE_HARBOR, APPI, CCI |
| ZODIAC_SIGN | Names of Zodiac signs Aries; Taurus |
Protected Health Information (PHI)
| Label | Description | Regulatory Compliance |
|---|---|---|
| BLOOD_TYPE | Blood types She's type AB positive |
CPRA, GDPR, Quebec Privacy Act |
| CONDITION | Names of medical conditions, diseases, syndromes, deficits, disorders chronic fatigue syndrome; arrhythmia; depression |
CPRA, GDPR, Quebec Privacy Act, APPI Sensitive |
| DOSE | Medically prescribed quantity of a medication limit intake to 700 mg/day |
|
| DRUG | Medications, vitamins, and supplements advil; Acetaminophen; Panadol |
CPRA, GDPR, Quebec Privacy Act, APPI Sensitive, CCI |
| INJURY | Bodily injuries, including mutations, miscarriages, and dislocations I broke my arm; I have a sprained wrist |
CPRA, GDPR, Quebec Privacy Act, APPI Sensitive |
| MEDICAL_PROCESS | Medical processes, including treatments, procedures, and tests heart surgery; CT scan |
CPRA, GDPR, Quebec Privacy Act, APPI Sensitive, CCI |
| STATISTICS | Medical statistics 18% of patients |
Quebec Privacy Act |
Payment Credit Industry Information (PCI)
| Label | Description | Policy & Regulatory Compliance |
|---|---|---|
| BANK_ACCOUNT | Bank account numbers and international equivalents, such as IBAN Acct. No.: 012345-67 |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI |
| CREDIT_CARD | Credit card numbers 0123 0123 0123 0123 **** **** ****4252 More detailsIncludes debit, ATM, Direct Debit, PrePay, Charge Cards, and support for cards that do not have 16 digits such as American Express or China UnionPay cards. Flags mentions of complete numbers as well as the last four digits only. |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI |
| CREDIT_CARD_ EXPIRATION | Expiration date of a credit card Expires: July 2023; Exp: 02/28 |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI |
| CVV | 3- or 4-digit card verification codes and equivalents CVV: 080 More detailsIncludes institution-specific variants: American Express: CID (card ID), CVD (card verification data) CSC / 3CSC (card security code) China UnionPay: CVN (card validation number) CIBC Mastercard: SPC (signature panel code) Discover: CID (card ID), CVD (card verification data) ELO (Brazil): CVE (Elo verification code) JCB (Japan Credit Bureau): CAV (card authentication value) Mastercard: CVC (card validation code) VISA: CVV (card verification value) |
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI |
| ROUTING_NUMBER | Routing number associated with a bank or financial institution 012345678 More detailsIncludes international equivalents: Canadian & British sort codes, Australian BSB numbers, Indian Financial System Codes, Branch/transit numbers, Institution numbers, and Swift codes |
CCI |
Beta Entity Types
Note that Beta support for the following entity types is currently only available with our English models.
| Label | Description | Regulatory Compliance |
|---|---|---|
| CORPORATE_ACTION | Any action a company takes that could affect its stock value or its shareholders Bridge Investment Group LLC (later renamed Bridge Investment Group Holdings LLC); We’ve merged two neighboring retail locations |
CCI |
| FINANCIAL_METRIC | Financial metrics or financial ratios are quantitative indicators of a company’s financial health adjusted earnings per share declined year-over-year; Online sales slow as UK shoppers rein in Christmas spending |
CCI |
| MEDICAL_CODE | Codes belonging to medical classification systems such as SNOMED, ICD-10, NDC, etc. 1981-03-11T04:11:32-03:00 Forearm sprain SNOMED-CT 70704007; <medcode type="string"> R74.8 <desc type="string">Abnormal levels of other serum enzymes |
CPRA, GDPR, GDPR Sensitive, Quebec Privacy Act, APPI Sensitive |
| PRODUCT | Names or model numbers of items made by an organization; includes intangible products like software and games, as well as other services iPhone; Toyota Camry |
CCI |
| TREND | A description of the “quality” or the direction in which a financial measurement is going reflecting the accelerating shift of off-line to online; amid rising costs and shrinking profits |
CCI |
International Entity Mapping
In the tables below, you can find localized variants of our entity types. For each entity type, there is a description, an example, and the label under which the entity falls. This section does not include entity types that may vary regionally, but still directly correspond to one of the entities listed above (e.g., PHONE_NUMBER, PASSPORT_NUMBER, DRIVER_LICENSE, LOCATION_ADDRESS, CREDIT_CARD_EXPIRATION). The following numbers are commonly used across many countries, hence are not included in each country’s table: GST (Goods and Services Tax), HST (Harmonized Sales Tax). These numbers are redacted as NUMERICAL_PII.
Asia Pacific
Australia
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Australian business number (ABN) | NUMERICAL_PII | A unique 11-digit identifier that every registered business in Australia is required to have | 12345678901 |
| Australian Company Number (ACN) | NUMERICAL_PII | A 9-digit number that must be displayed on all company documents | 123 456 789 |
| Bank-State-Branch (BSB) | ROUTING_NUMBER | A 6-digit number that identifies banks and branches across Australia | 123-456 |
| Tax File Number (TFN) | SSN | A 9-digit personal reference number used for tax and superannuation | 456 789 123 |
China
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| 的医保卡号 | HEALTHCARE_NUMBER | Healthcare number | Format varies by provider |
| 纳税人识别号码 | SSN | Taxpayer identification number consisting of 18 digits for individuals | 463728374657483746 |
India
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Aadhaar | SSN | A 12-digit individual identification number used as a proof of identity and address | 1234 5678 9123 |
| Financial System Code | ROUTING_NUMBER | A unique 11-digit alphanumeric code that is used for online fund transfer transactions | IDIB000T131 |
| Goods and Services Tax Identification Number (GSTIN) | NUMERICAL_PII | A unique 15-digit identification number assigned to every taxpayer in India | 56HNJCA5424K1DM |
| Permanent Account Number (PAN) | SSN | A unique 10-digit tax identification number issued by the Income Tax Department | ABCJF54312D |
Japan
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| 健康保険番号 | HEALTHCARE_NUMBER | Health insurance number | Format varies by provider |
| マイナンバー (個人番号) | SSN | My Number (also known as "personal number"), a unique 12-digit number assigned to every resident of Japan, whether Japanese or foreign | 123456789888 |
Korea
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| 건강보험증번호 | HEALTHCARE_NUMBER | Health insurance card number | Format varies by provider |
| 주민등록번호 | SSN | Resident Registration Number used for tax purposes, consists of 13 digits | 1236547898745 |
New Zealand
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Inland Revenue Department number (IRD) | SSN | A nine-digit individual identification number issued to each person by the New Zealand Inland Revenue Department, also known as a ‘tax file number’ | 099-999-999 |
Philippines
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| PhilHealth ID number | HEALTHCARE_NUMBER | 12-digit healthcare identification number | 11-455678912-3 |
| Social Security System number (SSS) | SSN | 10-digit number used for tax purposes | 12-3456789-1 |
| Tax Identification Number (TIN) | SSN | 12-digit number identifying a taxpayer | 123 456 789 002 |
Europe
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Value-Added Tax (VAT) | NUMERICAL_PII | A tax applied to all goods and services that are bought and sold for use or consumption in the European Union, formatted as 2 letters (country code) followed by 8-10 digits. Localized names include French "numéro TVA" | DK99999999 |
| International Bank Account Number (IBAN) | BANK_ACCOUNT | An international system of identifying bank accounts across national borders, consists of up to 34 alphanumeric characters including country codes | IE12BOFI90000112345678 |
Belgium
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Identificatienummer van de Sociale Zekerheid (INSZ) / Numéro d'identification à la sécurité sociale (NISS) | SSN | National identification number for social security, an 11-digit national registration number, the first 6 digits indicating date of birth | 99013187654 |
| Ondernemingsnummer | NUMERICAL_PII | A unique 10-digit identification number for a business | 1987654323 |
| Belasting Toegevoegde Waarde Nummer (BTW Nummer) | NUMERICAL_PII | An identification number for businesses used for VAT (Value Added Tax) purposes, formated as 2 letters followed by 10 digits | BE0784732737 |
Germany
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Bankkontonummer | BANK_ACCOUNT | Bank account number, 10 digits | 0532013000 |
| Krankenversicherungs nummer (KVNR) | HEALTHCARE_NUMBER | An alphanumeric code used for personal identification in Germany's national health insurance system (Krankenversicherung) | A123456789 |
| Sozialversicherungs nummer | SSN, HEALTHCARE_NUMBER | A 12-digit number used to track a person's social security contributions, doubles as a healthcare number | 12 123456 A 123 |
| Steuer-Identnummer (St-Nr) | SSN | A unique 11-digit number assigned to every taxpayer in Germany by the Federal Central Tax Office | 12345678909 |
France
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Numéro d'Inscription au Répertoire (NIR) | SSN | A 15-digit ID number commonly known as a numéro de sécurité sociale, also referred to as an Insee number, used for employment and French health benefits | 1790223354367-97 |
| Simplification des procedures d’Imposition (SPI) | SSN | French numéro de fiscal or a numéro SPI, a unique 13-digit tax number issued by the French tax authorities to all residents and non-residents with an obligation to pay tax | 12 34 567 891 234 |
| Système d'identification du répertoire des entreprises (SIREN) | NUM_PII | A 9-digit identifier assigned to every registered business in France by the National Institute of Statistics and Economic Studies (INSEE) | 732 829 320 |
Italy
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Numero di identificazione fiscale or codice fiscale | SSN | Tax Identification Number (TIN), a 9-12 digit numeric code | 000–123–456–789 |
Netherlands
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Burgerservicenummer (BSN) | SSN | A 9-digit citizen service number | 123456789 |
Portugal
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Número de Identificação da Segurança Social (NISS) | SSN | An 11-digit number used to identify individuals in the Portuguese social security system | 12354687985 |
Russia
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Идентификационный номер налогоплательщика (ИНН) | SSN | Taxpayer Personal Identification Number, 10-12 digits, used as a social security number | 12 34567891 23 |
Spain
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Número de la Seguridad Social (NUSS) | SSN | 11-12 digit social security number | 12 34567891 23 |
| Número de Identificación Fiscal (NIF) | SSN | A 10-character number that is used to interact with the Spanish tax agency | X12345678A |
Ukraine
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Ідентифікаційний номер платника податків (ІНПП) | SSN | A 10-digit Taxpayer Identification Number (TIN) | 1234567891 |
United Kingdom
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| National Insurance Number (NIN) | SSN | Used in the UK's social security system and tax system, formatted as 2 prefix letters, 6 digits, and 1 suffix letter | QQ123456C |
| Sort code | ROUTING_NUMBER | Identifies both the bank (in the first digit or first two digits) and the branch where the account is held, usually formatted as 3 pairs of numbers | 12-34-56 |
| U.K. Unique Taxpayer Reference Number (UTR) | SSN | A 10-digit number, also called "tax reference," used in the U.K. when submitting a tax return | 12345 67890 |
North and South America
Brazil
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Cadastro de Pessoas Físicas (CPF number) | SSN | Natural Persons Register, an 11-digit number in the format: 000.000.000-00 | 657.454.244-54 |
Canada
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Healthcare number | HEALTHCARE_NUMBER | Canadian Health Service Numbers, such as such as Care card number, OHIP, etc., required for access to healthcare benefits | Format varies by province |
| Numéro d'assurance sociale (NAS) | SSN | A 9-digit number that citizens and permanent residents need to work and be paid in Québec; French equivalent of SIN (see below) | 365 789 654 |
| Régie de l'assurance maladie du Québec (RAMQ) | HEALTHCARE_NUMBER | The Québec Health Insurance Number | BOUF 9401 1419 |
| Social Insurance Number (SIN) | SSN | A 9-digit number that citizens and permanent residents need to work and be paid in Canada | 321 654 987 |
| Sort code | ROUTING_NUMBER | A unique 9-digit code that identifies the financial institution (4 digits) and branch of account (5-digit Transit Code) | 123456789 |
Mexico
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Número de Seguridad Social (NSS) | SSN | Social Security Number, an 11-digit code | 12345678912 |
United States
| Identifier | PAI Label | Description | Example |
|---|---|---|---|
| Healthcare number | HEALTHCARE_NUMBER | A unique number assigned by a health insurance provider (includes private and government) | Format varies by provider |
| Social Security Number (SSN) | SSN | A 9-digit number issued to U.S. citizens, permanent residents, and (working) temporary residents | 453-65-4543 |
| U.S. Individual Taxpayer Identification Number (ITIN) | SSN | A 9-digit tax processing number that begins with "9" issued for some categories of population instead of SSN | 923-45-6789 |
2.1.7 - Redaction Languages
Cobalt features core support for 14 languages and extended support for 39 additional languages, with core languages featuring the highest level of performance. The complete list of supported languages below details which languages have core support, which have extended or beta support, and which are upcoming additions. New languages are continually being added, please contact us if you require a language not in the list below.
In addition to supporting 50+ languages, Cobalt offers support for regional language varieties in recognition of the large differences in vocabulary and grammar that can exist in the same language when spoken in different regions. So far, this includes support for varieties of English (US, UK, Canada and Australia), Spanish (Spain and Mexico), French (France and Canada), and Portuguese (Portugal and Brazil). Cobalt also supports code-switching, or mixing of different languages. This means that, in a phrase such as J’ai payé 76,88RM por ein Haarschnitt da 范玉菲 habang ko ay nasa Україна, multilingual PII is accurately de-identified. The selection of supported regional language varieties is continually being expanded, please let us know if there is a specific request.
Cobalt’s supported entity types function across each supported language, with multilingual equivalents of different PII (Personally Identifiable Information) entities, PHI (Protected Health Information) entities, and PCI (Payment Card Industry) entities being detected in each language. Our Supported Entity Types page provides a more detailed look at our coverage of language and region-specific entity equivalents. The solution is also sensitive to cross-linguistic differences in how names are structured, how place names are referred to, and how monetary units are described in different languages, among other differences.
Core Support
| Language | ISO Code | Supported Regional Varieties | Support Level | Text Support | Audio Support | File Support | Labels |
|---|---|---|---|---|---|---|---|
| Dutch | nl | The Netherlands | Core | ✓ | ✓ | ✓ | ✓ |
| English | en | Australia, Canada, United Kingdom, United States | Core | ✓ | ✓ | ✓ | ✓ |
| French | fr | Canada (Quebec), France, Switzerland | Core | ✓ | ✓ | ✓ | ✓ |
| German | de | Germany, Belgium, Austria, Switzerland | Core | ✓ | ✓ | ✓ | ✓ |
| Hindi | hi | India | Core | ✓ | ✓ | ✓ | |
| Italian | it | Italy, Switzerland | Core | ✓ | ✓ | ✓ | ✓ |
| Japanese | ja | Japan | Core | ✓ | ✓ | ✓ | ✓ |
| Korean | ko | Korea | Core | ✓ | ✓ | ✓ | |
| Mandarin (simplified) | zh-Hans | China, Singapore | Core | ✓ | ✓ | ✓ | ✓ |
| Portuguese | pt | Brazil, Portugal | Core | ✓ | ✓ | ✓ | ✓ |
| Russian | ru | Russia | Core | ✓ | ✓ | ✓ | |
| Spanish | es | Mexico, Spain | Core | ✓ | ✓ | ✓ | ✓ |
| Tagalog | tl | Philippines | Core | ✓ | ✓ | ||
| Ukrainian | uk | Ukraine | Core | ✓ | ✓ | ✓ |
Extended Support
| Language | ISO Code | Support Level | Text Support | Audio Support | File Support | Labels |
|---|---|---|---|---|---|---|
| Afrikaans | af | Extended | ✓ | |||
| Arabic | ar | Extended | ✓ | |||
| Bambara | bm | Extended | ✓ | |||
| Bengali | bn | Extended | ✓ | |||
| Belarusian | be | Extended | ✓ | |||
| Bulgarian | bg | Extended | ✓ | |||
| Burmese | my | Extended | ✓ | |||
| Cantonese (traditional) | zh-Hant | Extended | ✓ | |||
| Catalan | ca | Extended | ✓ | |||
| Croatian | hr | Extended | ✓ | |||
| Czech | cs | Extended | ✓ | |||
| Danish | da | Extended | ✓ | |||
| Estonian | et | Extended | ✓ | |||
| Finnish | fi | Extended | ✓ | |||
| Georgian | ka | Extended | ✓ | |||
| Greek | el | Extended | ✓ | |||
| Hebrew | he | Extended | ✓ | |||
| Hungarian | hu | Extended | ✓ | |||
| Icelandic | is | Extended | ✓ | |||
| Indonesian | id | Extended | ✓ | ✓ | ||
| Khmer | km | Extended | ✓ | |||
| Latvian | lv | Extended | ✓ | |||
| Lithuanian | lt | Extended | ✓ | |||
| Luxembourgish | lb | Extended | ✓ | |||
| Malay | ms | Extended | ✓ | |||
| Moldovan | ro | Extended | ✓ | |||
| Norwegian (Bokmål) | nb | Extended | ✓ | ✓ | ||
| Persian (Farsi) | fa | Extended | ✓ | |||
| Polish | pl | Extended | ✓ | ✓ | ✓ | |
| Punjabi | pa | Extended | ✓ | |||
| Romanian | ro | Extended | ✓ | |||
| Slovak | sk | Extended | ✓ | |||
| Slovenian | sl | Extended | ✓ | |||
| Swahili | sw | Extended | ✓ | |||
| Swedish | sv | Extended | ✓ | ✓ | ||
| Tamil | ta | Extended | ✓ | |||
| Thai | th | Extended | ✓ | |||
| Turkish | tr | Extended | ✓ | ✓ | ||
| Vietnamese | vi | Extended | ✓ |
2.1.8 - Prerequisites and System Requirements
Prerequisites
Info
Please only run one container instance per machine. Running multiple containers results in vastly reduced performance.The following prerequisites are required to run the container:
- Container engine, such as Docker (can be installed using the official instructions )
- (GPU only) Nvidia Container Toolkit with Nvidia driver version 515 or higher (can be installed using the following installation guide )
All other dependencies, such as CUDA are included with the container and don’t need to be installed separately.
System Requirements
- Docker and
docker-composeinstalled.
The image comes in two different build flavours:
- A compact, CPU-only container that runs on any Intel or AMD CPU and a container with GPU acceleration. The CPU container is highly optimised for the majority of use cases, as the container uses hand-coded AMX/AVX2/AVX512/AVX512 VNNI instructions in conjunction with Neural Network compression techniques to deliver a ~25X speedup over a reference transformer-based system.
- A GPU container is designed for large-scale deployments making billions of API calls or processing terabytes of data per month.
Minumum Requirements
The minimum system requirements for the container image are as follows:
| Minimum | Recommended (Text only) | Recommended (All Features) | Recommended Concurrency | |
|---|---|---|---|---|
| CPU | Any x86 (Intel or AMD) processor with 7.5GB free RAM and 50GB disk volume | Intel Sapphire Rapids or newer CPUs supporting AMX with 16GB RAM and 50GB disk volume | Intel Sapphire Rapids or newer CPUs supporting AMX with 64GB RAM and 100GB disk volume | 1 |
| GPU | Any x86 (Intel or AMD) processor with 28GB free RAM. Nvidia GPU with compute capability 7.0 or higher (Volta or newer) and at least 16GB VRAM. 100GB disk volume | Any x86 (Intel or AMD) processor with 32GB RAM and Nvidia Tesla T4 GPU. 100GB disk volume | Any x86 (Intel or AMD) processor with 64GB RAM and Nvidia Tesla T4 GPU. 100GB disk volume | 32 |
Recommended Requirements
While CPU-based container will run on any x86-compatible instance, the below cloud instance types give optimal throughput and latency per dollar:
| Platform | Recommended Instance Type (Text only) | Recommended Instance Type (All Features) |
|---|---|---|
| Azure | Standard_E2_v5 (2 vCPUs, 16GB RAM) | Standard_E8_v5 (8 vCPUs, 64GB RAM) |
| AWS | M7i.large (2 vCPUs, 8GB RAM) | m7i.4xlarge (16 vCPUs, 64GB RAM) |
| GCP | N2-Standard-2 (2 vCPUs, 8GB RAM) | N2-Standard-16 (16 vCPUs, 64GB RAM) |
- In the event when a lower latency is required, the instance type should be scaled; e.g. using an M7i.xlarge in place of a M7i.large. While the Cobalt docker solution can make use of all available CPU cores, it delivers best throughput per dollar using a single CPU core machine. Scaling CPU cores does not result in a linear increase in performance.
Similarly for the GPU-based image, it is recommended the following Nvidia T4 GPU-equipped instance types:
| Platform | Recommended Instance Type (Text only) | Recommended Instance Type (All Features) |
|---|---|---|
| Azure | Standard_NC8as_T4_v3 | Standard_NC8as_T4_v3 |
| AWS | G4dn.2xlarge | G4dn.4xlarge |
| GCP | N1-Standard-8 + Tesla T4 | N1-Standard-16 + Tesla T4 |
2.1.9 - Proto API Reference
The API is defined as a protobuf spec, so native bindings can be generated in any language with gRPC support. We recommend using buf to generate the bindings.
This section of the documentation is auto-generated from the protobuf spec. The service contains the methods that can be called, and the “messages” are the data structures (objects, classes or structs in the generated code, depending on the language) passed to and from the methods.
TrifidService
Service that implements the Cobalt Trifid Redaction Engine API.
Version
Version(VersionRequest) VersionResponse
Returns version information from the server.
ListModels
ListModels(ListModelsRequest) ListModelsResponse
ListModels returns information about the models the server can access.
RedactText
RedactText(RedactTextRequest) RedactTextResponse Redact text using a redaction engine that is configured with the provided redaction configuration.
RedactTranscript
RedactTranscript(RedactTranscriptRequest) RedactTranscriptResponse
redacts transcript using a redaction engine that is configured with the provided redaction configuration.
StreamingRedactTranscribedAudio
StreamingRedactTranscribedAudio(StreamingRedactTranscribedAudioRequest) StreamingRedactTranscribedAudioResponse
Performs bidirectional streaming redaction on transcribed audio. Receive redacted audio while sending audio. The transcription of audio data must be ready before sending the audio.
StreamingTranscribeAndRedact
StreamingTranscribeAndRedact(StreamingTranscribeAndRedactRequest) StreamingTranscribeAndRedactResponse
Performs bidirectional streaming speech recognition and redaction. Receive redacted audio and transcriptions while sending audio.
Messages
- If two or more fields in a message are labeled oneof, then each method call using that message must have exactly one of the fields populated
- If a field is labeled
repeated, then the generated code will accept an array (or struct, or list depending on the language).
ListModelsRequest
The top-level message sent by the client for the ListModels method.
ListModelsResponse
The message returned to the client by the ListModels method.
Fields
- models (ModelInfo repeated) List of models available for use on Trifid server.
ModelInfo
Description of a Trifid Model
Fields
-
id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for recognition, and is specified in the
RedactionConfigmessage. -
name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their recognition task.
-
redaction_classes (string repeated) List of supported redaction classes.
RedactTextRequest
The top-level message sent by the client for the ListModels method.
Fields
- redaction_config (RedactionConfig)
- text (string )
Unique identifier of the model. This identifier is used to choose the model
that should be used for recognition, and is specified in the
RedactionConfigmessage.
RedactionConfig
Configuration for setting up a redaction engine.
Fields
- model_id (string)
Unique identifier of the model to use, as obtained from a
ModelInfomessage. - redaction_classes (string repeated) List of whitelisted redaction classes. If the list is empty, server default redaction class list will be considered.
- disable_streaming (bool )
This is an optional field. If this is set to
true, Cobalt Privacy Screen will redact entire transcript at once, by doing so, redaction accuracy will increase at the cost of higher latency. If set tofalse, Cobalt Privacy Screen will redact one utterance at a time and return the result as soon as possible. The default isfalse. - custom_classes (CustomClasses repeated) This is an optional field. If set, then provided list will be used to extend the list of redaction classes.
CustomClasses
CustomClass allows the client to define a new redaction class. Patterns defined here will override default redaction class for matching tokens.
Fields
- redaction_class string This is name of the new redaction class. For example, this could be “COMPANY_NAME”.
- pattern string Pattern defines a Python regex expression that would be used to identify tokens in text that get redacted to this new redaction class. For example, “COBALT|GOOGLE|MICROSOFT”, or more complex patterns such as “^COMPANY-[\d]{4}$”.
ListModelsResponse
The message returned to the client by the ListModels method.
Fields
- models (ModelInfo repeated) List of models available for use on Trifid server.
ModelInfo
Description of a Trifid Model
Fields
-
id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for recognition, and is specified in the
RedactionConfigmessage. -
name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their recognition task.
-
redaction_classes (string repeated) List of supported redaction classes.
RedactTranscriptRequest
The top-level messages sent by the client for the RedactTranscript
method. Contains redaction config and a transcription to redact.
Fields
-
config (RedactionConfig ) redaction config
-
transcript (Transcript ) transcription to redact
RedactTranscriptResponse
The top-level message sent by the server for the RedactTranscript
method. Contains redacted transcript.
Fields
- transcript (Transcript )
RedactionConfig
Configuration for setting up a redaction engine.
Fields
-
model_id (string ) Unique identifier of the model to use, as obtained from a
ModelInfomessage. -
redaction_classes (string repeated) List of whitelisted redaction classes. If the list is empty, server default redaction class list will be considered.
-
disable_streaming (bool ) This is an optional field. If this is set to
true, Trifid will redact entire transcript at once, by doing so, redaction accuracy will increase at the cost of higher latency. If set tofalse, Trifid will redact one utterance at a time and return the result as soon as possible. The default isfalse.
Transcript
Transcript contains multiple utterance of the audio.
Fields
- utterances (Utterance repeated)
Utterance
Utterance of the audio
Fields
-
text (string ) Text representing the utterance of the audio.
-
audio_channel (uint32 ) Channel of the audio file associate with this utterance. Channels are 0-indexed, so the for mono audio data, this value will always be 0.
-
start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio corresponding to the start of this utterance.
-
duration_ms (uint64 ) Duration in milliseconds of the current utterance in the audio.
-
asr_confidence (double ) ASR confidence estimate between 0 and 1. A higher number represents a higher likelihood of the output being correct.
-
words_info (WordInfo repeated) Word-level information corresponding to the utterance. This field contains word-level timestamps, which are essential as input for audio redaction. This field is only available in an output utterance if
enable_word_infowas set totruein theRedactionConfig.
StreamingRedactTranscribedAudioRequest
The top-level messages sent by the client for the
StreamingRedactTranscribedAudio method. In this streaming call, multiple
StreamingRedactTranscribedAudioRequest messages should be sent. The first
message must contain a RedactTranscribedAudioConfig message only and all
subsequent messages must contain audio data only.
Fields
- oneof request.config (RedactTranscribedAudioConfig)
- oneof request.audio (bytes)
RedactTranscribedAudioConfig
Configuration for setting up a StreamingRedactTranscribedAudio method.
Fields
- redaction_config(RedactionConfig) Text redaction config.
- transcript(Trancript) Transcription of the entire audio. This must be ready before sending the audio.
StreamingRedactTranscribedAudioResponse
The top-level message sent by the server for the
StreamingRedactTranscribedAudio method. In this streaming call, multiple
StreamingRedactTranscribedAudioResponse messages contain either Utterance
or redacted audio data will be returned.
Fields
StreamingTranscribeAndRedactRequest
The top-level messages sent by the client for the
StreamingTranscribeAndRedact method. In this streaming call, multiple
StreamingTranscribeAndRedactRequest messages should be sent. The first
message must contain a TranscribeAndRedactConfig message only and all
subsequent messages must contain audio data only.
Fields
- oneof request.config (TranscribeAndRedactConfig)
- oneof request.audio (bytes)
TranscribeAndRedactConfig
Configuration for setting up a StreamingTranscribeAndRedact method.
Fields
- redaction_config(RedactionConfig) Text redaction config.
- enable_unredacted_transcript(bool)
This is an optional field. If this is set to
true, each utterance result will include unredacted utterance. If set tofalse, no unredacted utterance will be returned. The default isfalse.
StreamingTranscribeAndRedactResponse
The top-level message sent by the server for the
StreamingTranscribeAndRedact method. In this streaming call, multiple
StreamingTranscribeAndRedactResponse messages contain either
TranscribeAndRedactUtterance or redacted audio data will be returned.
Fields
VersionRequest
The top-level message sent by the client for the Version method.
VersionResponse
The top-level message sent by the server for the Version method.
Fields
- version (string ) Version of the server handling these requests.
WordInfo
Word level details for words in a utterance.
Fields
-
text (string ) The actual word corresponding to the utterance.
-
asr_confidence (double ) ASR confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized.
-
start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio received by the recognizer and corresponding to the start of this spoken word.
-
duration_ms (uint64 ) Duration in milliseconds of the current word in the spoken audio.
-
is_redacted (bool ) If this is set to true, it denotes that the curent word is redacted word or an original word of a redacted word.
-
redaction_class (string ) Recognized redaction class. This is available only if the current word is a redacted word.
-
redaction_confidence (double ) Redactio confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized. This is available only if the current word is a redacted word.
Enums
Scalar Value Types
2.2 - VoiceBio
2.2.1 - Getting Started
Using Cobalt VoiceBio
-
A typical VoiceBio release, provided as a compressed archive, will contain a linux binary (
voicebio-server) for the required native CPU architecture, appropriate Dockerfile and models. -
Cobalt VoiceBio runs either locally on linux or using Docker.
-
Cobalt VoiceBio will serve the VoiceBio GRPC API on port 2727.
-
To quickly try out VoiceBio, first start the server as shown below and use the SDK in your preferred language to use VoiceBio from the command line or within your application.
Info
Thecobalt.license.key file will be provided separately that must be copied into
the directory resulting from decompressing the archive. Please do this before
running the steps below.
Running VoiceBio Server Locally on Linux
./voicebio-server
- By default, the binary assumes the presence of a configuration file, located in the same directory, named:
voicebio-server.cfg.toml. A different config file may be specified using the--configargument.
Running VoiceBio Server as a Docker Container
To build and run the Docker image for VoiceBio, run:
docker build -t cobalt-voicebio .
docker run -p 2727:2727 -p 8080:8080 cobalt-voicebio
How to Get a Copy of the VoiceBio Server and Models
Contact us for getting a release best suited to your requirements.
The release you will receive is a compressed archive (tar.bz2) and is generally structured accordingly:
release.tar.bz2
├── COPYING
├── README.md
├── voicebio-server
├── voicebio-server.cfg.toml
├── Dockerfile
├── models
│ └── en_US-16khz
│
└── cobalt.license.key [ provided separately, needs to be copied over ]
-
The
README.mdfile contains information about this release and instructions for how to start the server on your system. -
The
voicebio-serveris the server program which is configured using thevoicebio-server.cfg.tomlfile. -
The
Dockerfilecan be used to create a container that will let you run VoiceBio server on non-linux systems such as MacOS and Windows. -
The
modelsdirectory contains the speaker ID models. The content of these directory will depend on the models you are provided.
System Requirements
Cobalt VoiceBio runs on Linux. You can run it directly as a linux application.
You can evaluate the product on Windows or Linux using Docker Desktop but we would not recommend this setup for use in a production environment.
A Cobalt VoiceBio release typically includes a single VoiceBio model together with binaries and config files. The general purpose VoiceBio models take up to 100MB of disk space, and need a minimum of 2GB RAM when evaluating locally. For production workloads, we recommend configuring containerized applications with each instance allocated with 4 CPUs and 4GB RAM.
Cobalt VoiceBio runs on x86_64 CPUs. We also support Arm64 CPUs, including processors such as the Graviton (AWS c7g EC2 instances). VoiceBio is significantly more cost effective to run on C7g instances compared to similarly sized Intel or AMD processors, and we can provide you an Arm64 release on request.
To integrate Cobalt VoiceBio into your application, please follow the next steps to install or generate the SDK in a language of your choice.
2.2.2 - Generating SDKs
-
APIs for all Cobalt’s services are defined as a protocol buffer specification or simply a
protofile and be found in thecobaltspeech/protogithub repository. -
The
protofile allows a developer to auto-generate client SDKs for a number of different programming languages. Step by step instructions for generating your own SDK can be found below. -
We provide pre-generated SDKs for a couple of languages. You can choose to use these instead of generating your own. These are listed here along with instructions on how to install / import them into your projects.
Pre-generated SDKs
Golang
-
Pre-generated SDK files for Golang can be found in the
cobaltspeech/go-genprotorepo -
To use it in your Go project, simply import it:
import voicebiopb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
- An example client using the above repo can be found here.
Python
-
Pre-generated SDK files for Python can be found in the
cobaltspeech/py-genprotorepo -
The Python SDK depends on Python >= 3.5. You may use pip to perform a system-wide install, or use virtualenv for a local install. To use it in your Python project, install it:
pip install --upgrade pip
pip install "git+https://github.com/cobaltspeech/py-genproto"
Generating SDKs
Step 1. Installing buf
- To work with
protofiles, we recommend usingbuf, a user-friendly command line tool that can be configured generate documentation, schemas and SDK code for different languages.
# Latest version as of March 14th, 2023.
COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/bin"
VERSION="1.15.1"
URL="https://github.com/bufbuild/buf/releases/download/v${VERSION}/buf-$(uname -s)-$(uname -m)"
curl -L ${URL} -o "${COBALT}/bin/buf"
# Give executable permissions and adding to $PATH.
chmod +x "${COBALT}/bin/buf"
export PATH="${PATH}:${COBALT}/bin"brew install bufbuild/buf/bufStep 2. Getting proto files
- Clone the
cobaltspeech/protorepository:
COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/git"
# Change this to where you want to clone the repo to.
PROTO_REPO="${COBALT}/git/proto"
git clone https://github.com/cobaltspeech/proto "${PROTO_REPO}"
Step 3. Generating code
-
The
cobaltspeech/protorepo provides abuf.gen.yamlconfig file to get you started with a couple of languages. -
Other plugins can be added to the
buf.gen.yamlfile to generate SDK code for more languages. -
To generate the SDKs, simply run the following (assuming the
bufbinary is in your$PATH)
cd "${PROTO_REPO}"
# Removing any previously generated files.
rm -rf ./gen
# Generating code for all proto files inside the `proto` directory.
buf generate proto
- You should now have a folder called
geninside${PROTO_REPO}that contains the generated code. The latest version of the VoiceBio API is v1. You can import / include / copy the generated files into your projects as per the conventions of different languages.
gen
├── ... other languages ...
└── py
└── cobaltspeech
├── ... other services ...
└── voicebio
└── v1
├── voicebio_pb2_grpc.py
├── voicebio_pb2.py
└── voicebio_pb2.pyigen
├── ... other languages ...
└── go
├── cobaltspeech
│ ├── ...
│ └── voicebio
│ └── v1
│ ├── voicebio_grpc.pb.go
│ └── voicebio.pb.go
└── gw
└── cobaltspeech
├── ...
└── voicebio
└── v1
└── voicebio.pb.gw.goStep 4. Installing gPRC and protobuf
- A couple of gRPC and protobuf dependencies are required along with the code generated above. The method of installing them depends on the programming language being used.
- These dependencies and the most common way of installing/ / including them are listed below for some chosen languages.
# It is encouraged to this inside a python virtual environment
# to avoid creating version conflicts for other scripts that may
# be using these libraries.
pip install --upgrade protobuf
pip install --upgrade grpcio
pip install --upgrade google-api-python-clientgo get google.golang.org/protobuf
go get google.golang.org/grpc
go get google.golang.org/genproto# More details on grpc installation can be found at:
# https://grpc.io/docs/languages/cpp/quickstart/
COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/git"
# Latest version as of 14th March, 2023.
VERSION="v1.52.0"
GRPC_REPO="${COBALT}/git/grpc-${VERSION}"
git clone \
--recurse-submodules --depth 1 --shallow-submodules \
-b "${VERSION}" \
https://github.com/grpc/grpc ${GRPC_REPO}
cd "${GRPC_REPO}"
mkdir -p cmake/build
# Change this to where you want to install libprotobuf and libgrpc.
# It is encouraged to install gRPC locally as there is no easy way to
# uninstall gRPC after you’ve installed it globally.
INSTALL_DIR="${COBALT}"
cd cmake/build
cmake \
-DgRPC_INSTALL=ON \
-DgRPC_BUILD_TESTS=OFF \
-DCMAKE_INSTALL_PREFIX=${INSTALL_DIR} \
../..
make -j
make install2.2.3 - Connecting to the Server
-
Once you have your VoiceBio server up and running, and have installed or generated the SDK for your project, you can connect to a running instance of VoiceBio server, by “dialing” a gRPC connection.
-
First, you need to know the address where the server is running: e.g.
host:grpc_port. By default, this islocalhost:2727and should be logged to the terminal when you first start VoiceBio server asgrpcAddr:
2023/08/14 10:49:38 info {"license":"Copyright © 2023--present. Cobalt Speech and Language, Inc. For additional details, including information about open source components used in this software, please see the COPYING file bundled with this program."}
2023/08/14 10:49:38 info {"msg":"reading config file","path":"configs/voicebio-server.config.toml"}
2023/08/14 10:49:38 info {"msg":"server initializing"}
2023/08/14 10:49:38 info {"msg":"license verified"}
2023/08/14 10:49:41 info {"msg":"runtime initialized","model_count":"2","init_time_taken":"2.512935646s"}
2023/08/14 10:49:41 info {"msg":"server started","grpcAddr":"[::]:2727","httpApiAddr":"[::]:8080","httpOpsAddr":"[::]:8081"}
Info
If you are hosting your server with Transport Layer Security (TLS) enabled, then please follow the instructions under Connection With TLS. Otherwise, you can follow the instructions for the Default Connection method.Default Connection
The following code snippet connects to the server and queries its version. It connects to the server using an “insecure” gRPC channel. This would be the case if you have just started up a local instance of VoiceBio server without TLS enabled.
import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)
# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)package main
import (
"context"
"fmt"
"os"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
voicebiopb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := voicebiopb.NewVoiceBioServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &voicebiopb.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
}Connect with TLS
-
In our recommended setup for deployment, TLS is enabled in the gRPC connection, and when connecting to the server, clients validate the server’s SSL certificate to make sure they are talking to the right party. This is similar to how “https” connections work in web browsers.
-
The following snippets show how to connect to a VoiceBio Server that has TLS enabled. They use the cobalt’s self-hosted demo server at
demo.cobaltspeech.com:2727, but you obviously use your own server instance.
Note
Commercial use of the demo server atdemo.cobaltspeech.com:2727 is not permitted.
This server is for testing and demonstration purposes only and is not guaranteed to
support high availability or high volume. Data uploaded to the server may be stored
for internal purposes.
import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio
serverAddress = "demo.cobaltspeech.com:2727"
# Setup a gRPC connection with TLS. You can optionally provide your own
# root certificates and private key to grpc.ssl_channel_credentials()
# for mutually authenticated TLS.
creds = grpc.ssl_channel_credentials()
channel = grpc.secure_channel(serverAddress, creds)
client = stub.VoiceBioServiceStub(channel)
# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)package main
import (
"context"
"crypto/tls"
"fmt"
"os"
"time"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials"
voicebiopb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)
func main() {
const (
serverAddress = "demo.cobaltspeech.com:2727"
connectTimeout = 10 * time.Second
)
// Setup a gRPC connection with TLS. You can optionally provide your own
// root certificates and private key through tls.Config for mutually
// authenticated TLS.
tlsCfg := tls.Config{}
creds := credentials.NewTLS(&tlsCfg)
ctx, cancel := context.WithTimeout(context.Background(), connectTimeout)
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(creds),
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := voicebiopb.NewVoiceBioServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &voicebiopb.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
}Client Authentication
-
In some setups, it may be desired that the server should also validate clients connecting to it and only respond to the ones it can verify. If your VoiceBio server is configured to do client authentication, you will need to present the appropriate certificate and key when connecting to it.
-
Please note that in the client-authentication mode, the client will still also verify the server’s certificate, and therefore this setup uses mutually authenticated TLS.
-
The following snippets show how to present client certificates when setting up the credentials. These could then be used in the same way as the examples above to connect to a TLS enabled server.
creds = grpc.ssl_channel_credentials(
root_certificates=root_certificates, # PEM certificate as byte string
private_key=private_key, # PEM client key as byte string
certificate_chain=certificate_chain, # PEM client certificate as byte string
)package main
import (
// ...
"crypto/tls"
"crypto/x509"
"fmt"
"os"
// ..
)
func main() {
// ...
// Root PEM certificate for validating self-signed server certificate
var rootCert []byte
// Client PEM certificate and private key.
var certPem, keyPem []byte
caCertPool := x509.NewCertPool()
if ok := caCertPool.AppendCertsFromPEM(rootCert); !ok {
fmt.Printf("unable to use given caCert\n")
os.Exit(1)
}
clientCert, err := tls.X509KeyPair(certPem, keyPem)
if err != nil {
fmt.Printf("unable to use given client certificate and key: %v\n", err)
os.Exit(1)
}
tlsCfg := tls.Config{
RootCAs: caCertPool,
Certificates: []tls.Certificate{clientCert},
}
creds := credentials.NewTLS(&tlsCfg)
// ...
}2.2.4 - Streaming Enrollment
- The following example shows how to stream audio using VoiceBio’s
StreamingEnrollrequest and generate a voiceprint. The stream can come from a file on disk or be directly from a microphone in real time.
Streaming from an audio file
-
We support several headered file formats including WAV, MP3, FLAC etc. For more details, please see the protocol buffer specification here. For best accuracy, it is recommended to use an uncompressed / loss-less compression audio format like WAV or FLAC.
-
The examples below use a WAV file as input. We will query the server for available models and use the first model to generate the voiceprint.
-
Generated Voiceprints can be updated and made more robust by re-enrolling them with additional audio. Please see the re-enrollment section.
import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)
# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)
# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())
print("Models:")
for model in modelResp.models:
print(model)
# Select a model ID from the list above. Going with the first model
# in this example.
modelID = modelResp.models[0].id
# Set the enrollment config. We don't set the audio format and let the
# server auto-detect the format from the file header.
cfg = voicebio.EnrollmentConfig(
model_id=modelID,
previous_voiceprint=None,
)
# The first request to the server should only contain the
# configuration. Subsequent requests should contain audio
# bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
yield voicebio.StreamingEnrollRequest(config=cfg)
data = audio.read(bufferSize)
while len(data) > 0:
yield voicebio.StreamingEnrollRequest(audio=voicebio.Audio(data=data))
data = audio.read(bufferSize)
# Streaming audio to the server.
with open("test.wav", "rb") as audio:
result = client.StreamingEnroll(stream(cfg, audio))
# A certain minimum duration of speech is required for completing enrollment.
# The enrollment status contains information on Whether that has been met or
# whether additional audio is required.
print(f"enrollment Status:\n{result.enrollment_status}\n")
# Saving the voiceprint data to a file. This can be provided again
# in another StreamingEnroll request (for continuing enrollment) or
# submitted for verification / identification requests.
with open("voiceprint.bin", 'w') as f:
f.write(result.voiceprint.data)package main
import (
"context"
"errors"
"fmt"
"io"
"os"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := voicebio.NewVoiceBioServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
// Get list model of models on the server.
modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
if err != nil {
fmt.Printf("failed to get model list: %v\n", err)
os.Exit(1)
}
fmt.Println("Models:")
for _, m := range modelResp.Models {
fmt.Println(m)
}
fmt.Println()
// Selecting the first model.
cfg := &voicebio.EnrollmentConfig{
ModelId: modelResp.Models[0].Id,
PreviousVoiceprint: nil,
}
// Opening audio file.
audio, err := os.Open("test.wav")
if err != nil {
fmt.Printf("failed to open audio file: %v\n", err)
os.Exit(1)
}
defer audio.Close()
// Starting enrollment.
result, err := StreamingEnroll(ctx, client, cfg, audio)
if err != nil {
fmt.Printf("failed to run streaming enrollment: %v\n", err)
os.Exit(1)
}
// A certain minimum duration of speech is required for completing enrollment.
// The enrollment status contains information on Whether that has been met or
// whether additional audio is required.
fmt.Printf("Enrollment Status: %v\n", result.EnrollmentStatus)
// Saving the voiceprint data to a file. This can be provided again
// in another StreamingEnroll request (for continuing enrollment) or
// submitted for verification / identification requests.
if err := os.WriteFile("voiceprint.bin", []byte(result.Voiceprint.Data), os.ModePerm); err != nil {
fmt.Printf("failed to write voiceprint data: %v\n", err)
os.Exit(1)
}
}
// StreamingEnroll wraps the streaming API for performing speaker enrollment
// (i.e. voiceprint generation) using the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to VoiceBio
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
func StreamingEnroll(
ctx context.Context,
client voicebio.VoiceBioServiceClient,
cfg *voicebio.EnrollmentConfig,
audio io.Reader,
) (*voicebio.StreamingEnrollResponse, error) {
const (
streamingBufSize = 1024
)
// Creating stream.
stream, err := client.StreamingEnroll(ctx)
if err != nil {
return nil, err
}
// Sending audio.
if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
// if sendAudio encountered io.EOF, it's only a
// notification that the stream has closed. The actual
// status will be obtained in the CloseAndRecv call. We
// therefore return on non-EOF errors here.
return nil, err
}
// Returning result.
return stream.CloseAndRecv()
}
// sendAudio sends the config and audio to a stream.
func sendAudio(
stream voicebio.VoiceBioService_StreamingEnrollClient,
cfg *voicebio.EnrollmentConfig,
audio io.Reader,
bufSize uint32,
) error {
// The first message needs to be a config message, and all subsequent
// messages must be audio messages.
// Send the config.
if err := stream.Send(&voicebio.StreamingEnrollRequest{
Request: &voicebio.StreamingEnrollRequest_Config{Config: cfg},
}); err != nil {
// if this failed, we don't need to CloseSend
return err
}
// Stream the audio.
buf := make([]byte, bufSize)
for {
n, err := audio.Read(buf)
if n > 0 {
if err2 := stream.Send(&voicebio.StreamingEnrollRequest{
Request: &voicebio.StreamingEnrollRequest_Audio{
Audio: &voicebio.Audio{Data: buf[:n]},
},
}); err2 != nil {
// if we couldn't Send, the stream has
// encountered an error and we don't need to
// CloseSend.
return err2
}
}
if err != nil {
// err could be io.EOF, or some other error reading from
// audio. In any case, we need to CloseSend, send the
// appropriate error to errCh and return from the function
if err2 := stream.CloseSend(); err2 != nil {
return err2
}
if err != io.EOF {
return err
}
return nil
}
}
}Streaming from microphone
-
Streaming audio from microphone input basically requires a reader interface that can provided audio samples recorded from a microphone; typically this requires interaction with system libraries. Another option is to use an external command line tool like
soxto record and pipe audio into the client. -
The examples below use the latter approach by using the
reccommand provided withsoxto record and stream the audio.
#!/usr/bin/env python3
# This example assumes sox is installed on the system and is available
# in the system's PATH variable. Instead of opening a regular file from
# disk, we open a subprocess that executes sox's rec command to record
# audio from the system's default microphone.
import subprocess
import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)
# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)
# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())
print("Models:")
for model in modelResp.models:
print(model)
# Select a model ID from the list above. Going with the first model
# in this example.
m = modelResp.models[0]
modelID = m.id
# Setting audio format to be raw 16-bit signed little endian audio samples
# recorded at the sample rate expected by the model.
cfg = voicebio.EnrollmentConfig(
model_id=modelID,
previous_voiceprint=None,
audio_format=voicebio.AudioFormat(
audio_format_raw=voicebio.AudioFormatRAW(
encoding="AUDIO_ENCODING_SIGNED",
bit_depth=16,
byte_order="BYTE_ORDER_LITTLE_ENDIAN",
sample_rate=m.attributes.sample_rate,
channels=1,
)
),
)
# Open microphone stream using sox's rec command and record
# audio using the config specified above for *10 seconds*.
maxDuration = 10
cmd = f"rec -t raw -r {m.attributes.sample_rate} -e signed -b 16 -L -c 1 - trim 0 {maxDuration}"
mic = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
audio = mic.stdout
try:
_ = audio.read(1024) # Trying to read some bytes as sanity check.
except Exception as err:
print(f"[ERROR] failed to read audio from mic stream: {err}")
print(f"\n[INFO] recording {maxDuration} seconds of audio microphone ... \n")
# The first request to the server should only contain the
# recognition configuration. Subsequent requests should contain
# audio bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
yield voicebio.StreamingEnrollRequest(config=cfg)
data = audio.read(bufferSize)
while len(data) > 0:
yield voicebio.StreamingEnrollRequest(audio=voicebio.Audio(data=data))
data = audio.read(bufferSize)
# Streaming audio to the server.
result = client.StreamingEnroll(stream(cfg, audio))
# A certain minimum duration of speech is required for completing enrollment.
# The enrollment status contains information on Whether that has been met or
# whether additional audio is required.
print(f"enrollment Status:\n{result.enrollment_status}\n")
# Saving the voiceprint data to a file. This can be provided again
# in another StreamingEnroll request (for continuing enrollment) or
# submitted for verification / identification requests.
with open("voiceprint.bin", 'w') as f:
f.write(result.voiceprint.data)
audio.close()
mic.kill()package main
import (
"context"
"errors"
"fmt"
"io"
"os"
"os/exec"
"strings"
"golang.org/x/sync/errgroup"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := voicebio.NewVoiceBioServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
// Get list model of models on the server.
modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
if err != nil {
fmt.Printf("failed to get model list: %v\n", err)
os.Exit(1)
}
fmt.Println("Models:")
for _, m := range modelResp.Models {
fmt.Println(m)
}
fmt.Println()
// Selecting first model.
m := modelResp.Models[0]
// Setting audio format to be raw 16-bit signed little endian audio samples
// recorded at the sample rate expected by the model.
cfg := &voicebio.EnrollmentConfig{
ModelId: m.Id,
PreviousVoiceprint: nil,
AudioFormat: &voicebio.AudioFormat{AudioFormat: &voicebio.AudioFormat_AudioFormatRaw{
AudioFormatRaw: &voicebio.AudioFormatRAW{
Encoding: voicebio.AudioEncoding_AUDIO_ENCODING_SIGNED,
SampleRate: m.Attributes.SampleRate,
BitDepth: 16,
ByteOrder: voicebio.ByteOrder_BYTE_ORDER_LITTLE_ENDIAN,
Channels: 1,
},
},
},
}
// Open microphone stream using sox's rec command and record
// audio using the config specified above for *10 seconds*.
maxDuration := 10
args := fmt.Sprintf("-t raw -r %d -e signed -b 16 -L -c 1 - trim 0 %d", m.Attributes.SampleRate, maxDuration)
cmd := exec.CommandContext(ctx, "rec", strings.Fields(args)...)
cmd.Stderr = os.Stderr
audio, err := cmd.StdoutPipe()
if err != nil {
fmt.Printf("failed to open microphone stream: %v\n", err)
os.Exit(1)
}
// Starting routines to record from microphone and stream to server
// using an errgroup.Group that returns if either one encounters an error.
eg, ctx := errgroup.WithContext(ctx)
eg.Go(func() error {
fmt.Printf("\n[INFO] recording %d seconds from microphone \n", maxDuration)
if err := cmd.Run(); err != nil {
return fmt.Errorf("record from microphone: %w", err)
}
return nil
})
// Starting enrollment.
result, err := StreamingEnroll(ctx, client, cfg, audio)
if err != nil {
fmt.Printf("failed to run streaming enrollment: %v\n", err)
os.Exit(1)
}
if err := eg.Wait(); err != nil {
fmt.Printf("%v\n", err)
os.Exit(1)
}
// A certain minimum duration of speech is required for completing enrollment.
// The enrollment status contains information on Whether that has been met or
// whether additional audio is required.
fmt.Printf("Enrollment Status: %v\n", result.EnrollmentStatus)
// Saving the voiceprint data to a file. This can be provided again
// in another StreamingEnroll request (for continuing enrollment) or
// submitted for verification / identification requests.
if err := os.WriteFile("voiceprint.bin", []byte(result.Voiceprint.Data), os.ModePerm); err != nil {
fmt.Printf("failed to wriet voiceprint data: %v\n", err)
os.Exit(1)
}
}
// StreamingEnroll wraps the streaming API for performing speaker enrollment
// (i.e. voiceprint generation) using the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to VoiceBio
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
func StreamingEnroll(
ctx context.Context,
client voicebio.VoiceBioServiceClient,
cfg *voicebio.EnrollmentConfig,
audio io.Reader,
) (*voicebio.StreamingEnrollResponse, error) {
const (
streamingBufSize = 1024
)
// Creating stream.
stream, err := client.StreamingEnroll(ctx)
if err != nil {
return nil, err
}
// Sending audio.
if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
// if sendAudio encountered io.EOF, it's only a
// notification that the stream has closed. The actual
// status will be obtained in the CloseAndRecv call. We
// therefore return on non-EOF errors here.
return nil, err
}
// Returning result.
return stream.CloseAndRecv()
}
// sendAudio sends audio to a stream.
func sendAudio(
stream voicebio.VoiceBioService_StreamingEnrollClient,
cfg *voicebio.EnrollmentConfig,
audio io.Reader,
bufSize uint32,
) error {
// The first message needs to be a config message, and all subsequent
// messages must be audio messages.
// Send the config.
if err := stream.Send(&voicebio.StreamingEnrollRequest{
Request: &voicebio.StreamingEnrollRequest_Config{Config: cfg},
}); err != nil {
// if this failed, we don't need to CloseSend
return err
}
// Stream the audio.
buf := make([]byte, bufSize)
for {
n, err := audio.Read(buf)
if n > 0 {
if err2 := stream.Send(&voicebio.StreamingEnrollRequest{
Request: &voicebio.StreamingEnrollRequest_Audio{
Audio: &voicebio.Audio{Data: buf[:n]},
},
}); err2 != nil {
// if we couldn't Send, the stream has
// encountered an error and we don't need to
// CloseSend.
return err2
}
}
if err != nil {
// err could be io.EOF, or some other error reading from
// audio. In any case, we need to CloseSend, send the
// appropriate error to errCh and return from the function
if err2 := stream.CloseSend(); err2 != nil {
return err2
}
if err != io.EOF {
return err
}
return nil
}
}
}Re-enrollment
- Voiceprints can be updated and made more robust by re-enrolling them with
additional audio. This can be easily done by providing previous voiceprint
data in the
EnrollmentConfigalong with additional audio in a newStreamingEnrollrequest.
# Connect to server ...
with open("voiceprint.bin", 'r') as f:
voiceprint = f.read().strip()
cfg = voicebio.EnrollmentConfig(
model_id=modelID,
previous_voiceprint=voicebio.Voiceprint(data=voiceprint),
)
# Send audio to server ...package main
import (
"context"
"errors"
"fmt"
"io"
"os"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)
func main() {
// Connect to server ...
// Reading old voiceprint data.
data, err := os.ReadFile("voiceprint.bin")
if err != nil {
fmt.Printf("\nfailed to read voiceprint data: %v\n", err)
os.Exit(1)
}
cfg := &voicebio.EnrollmentConfig{
ModelId: modelResp.Models[0].Id,
PreviousVoiceprint: &voicebio.Voiceprint{Data: string(data)},
}
// Send audio to server ...
}2.2.5 - Streaming Verification
- The following example shows how to stream audio using VoiceBio’s
StreamingVerifyrequest and verify whether the audio matches the provided voiceprint. The stream can come from a file on disk or be directly from a microphone in real time.
Streaming from an audio file
-
We support several headered file formats including WAV, MP3, FLAC etc. For more details, please see the protocol buffer specification here. For best accuracy, it is recommended to use an uncompressed / loss-less compression audio format like WAV or FLAC.
-
The examples below use a WAV file as input. We will query the server for available models and use the first model to score and verify given audio against a given voiceprint.
Info
Voiceprints provided inStreamingVerify requests must be generated using the
same or compatible model via StreamingEnroll.
import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)
# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)
# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())
print("Models:")
for model in modelResp.models:
print(model)
# Select a model ID from the list above. Going with the first model
# in this example.
modelID = modelResp.models[0].id
# Loading reference voiceprint.
with open("voiceprint.bin", 'r') as f:
voiceprint = voicebio.Voiceprint(data=f.read().strip())
# Set the verification config. We don't set the audio format and let the
# server auto-detect the format from the file header.
cfg = voicebio.VerificationConfig(
model_id=modelID,
voiceprint=voiceprint,
)
# The first request to the server should only contain the
# configuration. Subsequent requests should contain audio
# bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
yield voicebio.StreamingVerifyRequest(config=cfg)
data = audio.read(bufferSize)
while len(data) > 0:
yield voicebio.StreamingVerifyRequest(audio=voicebio.Audio(data=data))
data = audio.read(bufferSize)
# Streaming audio to the server.
with open("test.wav", "rb") as audio:
resp = client.StreamingVerify(stream(cfg, audio))
# Server returns a similarity score along with whether the score
# exceeded the server-configured threshold for being a match.
print(f"Verification Score: {resp.result.similarity_score:1.3f}, Match: {resp.result.is_match}")package main
import (
"context"
"errors"
"fmt"
"io"
"os"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := voicebio.NewVoiceBioServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
// Get list model of models on the server.
modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
if err != nil {
fmt.Printf("failed to get model list: %v\n", err)
os.Exit(1)
}
fmt.Println("Models:")
for _, m := range modelResp.Models {
fmt.Println(m)
}
fmt.Println()
// Reading voiceprint data.
data, err := os.ReadFile("voiceprint.bin")
if err != nil {
fmt.Printf("\nfailed to read voiceprint data: %v\n", err)
os.Exit(1)
}
// Selecting the first model.
cfg := &voicebio.VerificationConfig{
ModelId: modelResp.Models[0].Id,
Voiceprint: &voicebio.Voiceprint{Data: string(data)},
}
// Opening audio file.
audio, err := os.Open("test.wav")
if err != nil {
fmt.Printf("failed to open audio file: %v\n", err)
os.Exit(1)
}
defer audio.Close()
// Starting verification.
resp, err := StreamingVerify(ctx, client, cfg, audio)
if err != nil {
fmt.Printf("failed to run streaming verification: %v\n", err)
os.Exit(1)
}
// Server returns a similarity score along with whether the score
// exceeded the server-configured threshold for being a match.
fmt.Printf("Verification Score: %1.3f, Match: %v\n", resp.Result.SimilarityScore, resp.Result.IsMatch)
}
// StreamingVerify wraps the streaming API for performing speaker verification
// using the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to VoiceBio
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
func StreamingVerify(
ctx context.Context,
client voicebio.VoiceBioServiceClient,
cfg *voicebio.VerificationConfig,
audio io.Reader,
) (*voicebio.StreamingVerifyResponse, error) {
const (
streamingBufSize = 1024
)
// Creating stream.
stream, err := client.StreamingVerify(ctx)
if err != nil {
return nil, err
}
// Sending audio.
if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
// if sendAudio encountered io.EOF, it's only a
// notification that the stream has closed. The actual
// status will be obtained in the CloseAndRecv call. We
// therefore return on non-EOF errors here.
return nil, err
}
// Returning result.
return stream.CloseAndRecv()
}
// sendAudio sends the config and audio to a stream.
func sendAudio(
stream voicebio.VoiceBioService_StreamingVerifyClient,
cfg *voicebio.VerificationConfig,
audio io.Reader,
bufSize uint32,
) error {
// The first message needs to be a config message, and all subsequent
// messages must be audio messages.
// Send the config.
if err := stream.Send(&voicebio.StreamingVerifyRequest{
Request: &voicebio.StreamingVerifyRequest_Config{Config: cfg},
}); err != nil {
// if this failed, we don't need to CloseSend
return err
}
// Stream the audio.
buf := make([]byte, bufSize)
for {
n, err := audio.Read(buf)
if n > 0 {
if err2 := stream.Send(&voicebio.StreamingVerifyRequest{
Request: &voicebio.StreamingVerifyRequest_Audio{
Audio: &voicebio.Audio{Data: buf[:n]},
},
}); err2 != nil {
// if we couldn't Send, the stream has
// encountered an error and we don't need to
// CloseSend.
return err2
}
}
if err != nil {
// err could be io.EOF, or some other error reading from
// audio. In any case, we need to CloseSend, send the
// appropriate error to errCh and return from the function
if err2 := stream.CloseSend(); err2 != nil {
return err2
}
if err != io.EOF {
return err
}
return nil
}
}
}Streaming from microphone
-
Streaming audio from microphone input basically requires a reader interface that can provided audio samples recorded from a microphone; typically this requires interaction with system libraries. Another option is to use an external command line tool like
soxto record and pipe audio into the client. -
The examples below use the latter approach by using the
reccommand provided withsoxto record and stream the audio.
#!/usr/bin/env python3
# This example assumes sox is installed on the system and is available
# in the system's PATH variable. Instead of opening a regular file from
# disk, we open a subprocess that executes sox's rec command to record
# audio from the system's default microphone.
import subprocess
import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)
# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)
# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())
print("Models:")
for model in modelResp.models:
print(model)
# Select a model ID from the list above. Going with the first model
# in this example.
m = modelResp.models[0]
modelID = m.id
# Loading reference voiceprint.
with open("voiceprint.bin", 'r') as f:
voiceprint = voicebio.Voiceprint(data=f.read().strip())
# Setting audio format to be raw 16-bit signed little endian audio samples
# recorded at the sample rate expected by the model.
cfg = voicebio.VerificationConfig(
model_id=modelID,
voiceprint=voiceprint,
audio_format=voicebio.AudioFormat(
audio_format_raw=voicebio.AudioFormatRAW(
encoding="AUDIO_ENCODING_SIGNED",
bit_depth=16,
byte_order="BYTE_ORDER_LITTLE_ENDIAN",
sample_rate=m.attributes.sample_rate,
channels=1,
)
),
)
# Open microphone stream using sox's rec command and record
# audio using the config specified above for *10 seconds*.
maxDuration = 10
cmd = f"rec -t raw -r {m.attributes.sample_rate} -e signed -b 16 -L -c 1 - trim 0 {maxDuration}"
mic = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
audio = mic.stdout
try:
_ = audio.read(1024) # Trying to read some bytes as sanity check.
except Exception as err:
print(f"[ERROR] failed to read audio from mic stream: {err}")
print(f"\n[INFO] recording {maxDuration} seconds of audio microphone ... \n")
# The first request to the server should only contain the
# recognition configuration. Subsequent requests should contain
# audio bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
yield voicebio.StreamingVerifyRequest(config=cfg)
data = audio.read(bufferSize)
while len(data) > 0:
yield voicebio.StreamingVerifyRequest(audio=voicebio.Audio(data=data))
data = audio.read(bufferSize)
# Streaming audio to the server.
resp = client.StreamingVerify(stream(cfg, audio))
# Server returns a similarity score along with whether the score
# exceeded the server-configured threshold for being a match.
print(f"Verification Score: {resp.result.similarity_score:1.3f}, Match: {resp.result.is_match}")
audio.close()
mic.kill()package main
import (
"context"
"errors"
"fmt"
"io"
"os"
"os/exec"
"strings"
"golang.org/x/sync/errgroup"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := voicebio.NewVoiceBioServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
// Get list model of models on the server.
modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
if err != nil {
fmt.Printf("failed to get model list: %v\n", err)
os.Exit(1)
}
fmt.Println("Models:")
for _, m := range modelResp.Models {
fmt.Println(m)
}
fmt.Println()
// Selecting first model.
m := modelResp.Models[0]
// Reading voiceprint data.
data, err := os.ReadFile("voiceprint.bin")
if err != nil {
fmt.Printf("\nfailed to read voiceprint data: %v\n", err)
os.Exit(1)
}
// Setting audio format to be raw 16-bit signed little endian audio samples
// recorded at the sample rate expected by the model.
cfg := &voicebio.VerificationConfig{
ModelId: m.Id,
Voiceprint: &voicebio.Voiceprint{Data: string(data)},
AudioFormat: &voicebio.AudioFormat{AudioFormat: &voicebio.AudioFormat_AudioFormatRaw{
AudioFormatRaw: &voicebio.AudioFormatRAW{
Encoding: voicebio.AudioEncoding_AUDIO_ENCODING_SIGNED,
SampleRate: m.Attributes.SampleRate,
BitDepth: 16,
ByteOrder: voicebio.ByteOrder_BYTE_ORDER_LITTLE_ENDIAN,
Channels: 1,
},
},
},
}
// Open microphone stream using sox's rec command and record
// audio using the config specified above for *10 seconds*.
maxDuration := 10
args := fmt.Sprintf("-t raw -r %d -e signed -b 16 -L -c 1 - trim 0 %d", m.Attributes.SampleRate, maxDuration)
cmd := exec.CommandContext(ctx, "rec", strings.Fields(args)...)
cmd.Stderr = os.Stderr
audio, err := cmd.StdoutPipe()
if err != nil {
fmt.Printf("failed to open microphone stream: %v\n", err)
os.Exit(1)
}
// Starting routines to record from microphone and stream to server
// using an errgroup.Group that returns if either one encounters an error.
eg, ctx := errgroup.WithContext(ctx)
eg.Go(func() error {
fmt.Printf("\n[INFO] recording %d seconds from microphone \n", maxDuration)
if err := cmd.Run(); err != nil {
return fmt.Errorf("record from microphone: %w", err)
}
return nil
})
// Starting verification.
resp, err := StreamingVerify(ctx, client, cfg, audio)
if err != nil {
fmt.Printf("failed to run streaming verification: %v\n", err)
os.Exit(1)
}
// Server returns a similarity score along with whether the score
// exceeded the server-configured threshold for being a match.
fmt.Printf("Verification Score: %1.3f, Match: %v\n", resp.Result.SimilarityScore, resp.Result.IsMatch)
}
// StreamingVerify wraps the streaming API for performing speaker verification
// using the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to VoiceBio
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
func StreamingVerify(
ctx context.Context,
client voicebio.VoiceBioServiceClient,
cfg *voicebio.VerificationConfig,
audio io.Reader,
) (*voicebio.StreamingVerifyResponse, error) {
const (
streamingBufSize = 1024
)
// Creating stream.
stream, err := client.StreamingVerify(ctx)
if err != nil {
return nil, err
}
// Sending audio.
if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
// if sendAudio encountered io.EOF, it's only a
// notification that the stream has closed. The actual
// status will be obtained in the CloseAndRecv call. We
// therefore return on non-EOF errors here.
return nil, err
}
// Returning result.
return stream.CloseAndRecv()
}
// sendAudio sends the config and audio to a stream.
func sendAudio(
stream voicebio.VoiceBioService_StreamingVerifyClient,
cfg *voicebio.VerificationConfig,
audio io.Reader,
bufSize uint32,
) error {
// The first message needs to be a config message, and all subsequent
// messages must be audio messages.
// Send the config.
if err := stream.Send(&voicebio.StreamingVerifyRequest{
Request: &voicebio.StreamingVerifyRequest_Config{Config: cfg},
}); err != nil {
// if this failed, we don't need to CloseSend
return err
}
// Stream the audio.
buf := make([]byte, bufSize)
for {
n, err := audio.Read(buf)
if n > 0 {
if err2 := stream.Send(&voicebio.StreamingVerifyRequest{
Request: &voicebio.StreamingVerifyRequest_Audio{
Audio: &voicebio.Audio{Data: buf[:n]},
},
}); err2 != nil {
// if we couldn't Send, the stream has
// encountered an error and we don't need to
// CloseSend.
return err2
}
}
if err != nil {
// err could be io.EOF, or some other error reading from
// audio. In any case, we need to CloseSend, send the
// appropriate error to errCh and return from the function
if err2 := stream.CloseSend(); err2 != nil {
return err2
}
if err != io.EOF {
return err
}
return nil
}
}
}2.2.6 - Streaming Identification
- The following example shows how to stream audio using VoiceBio’s
StreamingIdentifyrequest and identify the speaker in the audio using provided voiceprints. The stream can come from a file on disk or be directly from a microphone in real time.
Info
If you want to compare against a large number of voiceprints in multiple batches, it will be more efficient to extract the voiceprint from the audio once using theStreamingEnroll request, and then compare
voiceprints directly without audio via the
CompareVoiceprints request.
Streaming from an audio file
-
We support several headered file formats including WAV, MP3, FLAC etc. For more details, please see the protocol buffer specification here. For best accuracy, it is recommended to use an uncompressed / loss-less compression audio format like WAV or FLAC.
-
The examples below use a WAV file as input. We will query the server for available models and use the first model to score and identify the given audio against a given set of voiceprints.
Info
Voiceprints provided inStreamingIdentify requests must be generated using the
same or compatible model via StreamingEnroll.
import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)
# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)
# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())
print("Models:")
for model in modelResp.models:
print(model)
# Select a model ID from the list above. Going with the first model
# in this example.
modelID = modelResp.models[0].id
# Loading reference voiceprints.
voiceprints = []
for p in ["user1.bin", "user2.bin", "user3.bin"]:
with open(p, 'r') as f:
voiceprints.append(voicebio.Voiceprint(data=f.read().strip()))
# Set the identification config. We don't set the audio format and let the
# server auto-detect the format from the file header.
cfg = voicebio.IdentificationConfig(
model_id=modelID,
voiceprints=voiceprints,
)
# The first request to the server should only contain the
# configuration. Subsequent requests should contain audio
# bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
yield voicebio.StreamingIdentifyRequest(config=cfg)
data = audio.read(bufferSize)
while len(data) > 0:
yield voicebio.StreamingIdentifyRequest(audio=voicebio.Audio(data=data))
data = audio.read(bufferSize)
# Streaming audio to the server.
with open("test.wav", "rb") as audio:
result = client.StreamingIdentify(stream(cfg, audio))
# Server returns the index of the voiceprint that matches the best, a similarity
# score for each voiceprint along with whether the score exceeded the server-configured
# threshold for being a match.
#
# If none of the voiceprints were a good match, the best match index will be negative.
matched = "❌ No Match found"
if result.best_match_index >= 0:
best_score = result.voiceprint_comparison_results[result.best_match_index].similarity_score
matched = f"✅ Match found: Index: {result.best_match_index}, Score: {best_score:1.3f}"
print(f"\nIdentification Result:\n")
print("Scores:")
for i, r in enumerate(result.voiceprint_comparison_results):
print(f"Index: {i}, Score: {r.similarity_score:1.3f}, IsMatch: {r.is_match}")
print(f"\n{matched}")package main
import (
"context"
"errors"
"fmt"
"io"
"os"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := voicebio.NewVoiceBioServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
// Get list model of models on the server.
modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
if err != nil {
fmt.Printf("failed to get model list: %v\n", err)
os.Exit(1)
}
fmt.Println("Models:")
for _, m := range modelResp.Models {
fmt.Println(m)
}
fmt.Println()
// Reading voiceprint data.
voiceprints := make([]*voicebio.Voiceprint, 0)
for i, p := range []string{"user1.bin", "user2.bin", "user3.bin"} {
data, err := os.ReadFile(p)
if err != nil {
fmt.Printf("\nfailed to read voiceprint[%d] data: %v\n", i, err)
os.Exit(1)
}
voiceprints = append(voiceprints, &voicebio.Voiceprint{Data: string(data)})
}
// Selecting the first model.
cfg := &voicebio.IdentificationConfig{
ModelId: modelResp.Models[0].Id,
Voiceprints: voiceprints,
}
// Opening audio file.
audio, err := os.Open("test.wav")
if err != nil {
fmt.Printf("failed to open audio file: %v\n", err)
os.Exit(1)
}
defer audio.Close()
// Starting identification.
result, err := StreamingIdentify(ctx, client, cfg, audio)
if err != nil {
fmt.Printf("failed to run streaming identification: %v\n", err)
os.Exit(1)
}
// Server returns the index of the voiceprint that matches the best, a similarity
// score for each voiceprint along with whether the score exceeded the server-configured
// threshold for being a match.
//
// If none of the voiceprints were a good match, the best match index will be negative.
matched := "❌ No Match found"
if result.BestMatchIndex >= 0 {
bestScore := result.VoiceprintComparisonResults[result.BestMatchIndex].SimilarityScore
matched = fmt.Sprintf("✅ Match found: Index: %d, Score: %1.3f", result.BestMatchIndex, bestScore)
}
fmt.Printf("\nIdentification Result:\n")
fmt.Printf("Scores:\n")
for i, r := range result.VoiceprintComparisonResults {
fmt.Printf("Index: %d, Score: %1.3f, IsMatch: %v\n", i, r.SimilarityScore, r.IsMatch)
}
fmt.Printf("\n%s\n", matched)
}
// StreamingIdentify wraps the streaming API for performing speaker identification
// using the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to VoiceBio
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
func StreamingIdentify(
ctx context.Context,
client voicebio.VoiceBioServiceClient,
cfg *voicebio.IdentificationConfig,
audio io.Reader,
) (*voicebio.StreamingIdentifyResponse, error) {
const (
streamingBufSize = 1024
)
// Creating stream.
stream, err := client.StreamingIdentify(ctx)
if err != nil {
return nil, err
}
// Sending audio.
if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
// if sendAudio encountered io.EOF, it's only a
// notification that the stream has closed. The actual
// status will be obtained in the CloseAndRecv call. We
// therefore return on non-EOF errors here.
return nil, err
}
// Returning result.
return stream.CloseAndRecv()
}
// sendAudio sends the config and audio to a stream.
func sendAudio(
stream voicebio.VoiceBioService_StreamingIdentifyClient,
cfg *voicebio.IdentificationConfig,
audio io.Reader,
bufSize uint32,
) error {
// The first message needs to be a config message, and all subsequent
// messages must be audio messages.
// Send the config.
if err := stream.Send(&voicebio.StreamingIdentifyRequest{
Request: &voicebio.StreamingIdentifyRequest_Config{Config: cfg},
}); err != nil {
// if this failed, we don't need to CloseSend
return err
}
// Stream the audio.
buf := make([]byte, bufSize)
for {
n, err := audio.Read(buf)
if n > 0 {
if err2 := stream.Send(&voicebio.StreamingIdentifyRequest{
Request: &voicebio.StreamingIdentifyRequest_Audio{
Audio: &voicebio.Audio{Data: buf[:n]},
},
}); err2 != nil {
// if we couldn't Send, the stream has
// encountered an error and we don't need to
// CloseSend.
return err2
}
}
if err != nil {
// err could be io.EOF, or some other error reading from
// audio. In any case, we need to CloseSend, send the
// appropriate error to errCh and return from the function
if err2 := stream.CloseSend(); err2 != nil {
return err2
}
if err != io.EOF {
return err
}
return nil
}
}
}Streaming from microphone
-
Streaming audio from microphone input basically requires a reader interface that can provided audio samples recorded from a microphone; typically this requires interaction with system libraries. Another option is to use an external command line tool like
soxto record and pipe audio into the client. -
The examples below use the latter approach by using the
reccommand provided withsoxto record and stream the audio.
#!/usr/bin/env python3
# This example assumes sox is installed on the system and is available
# in the system's PATH variable. Instead of opening a regular file from
# disk, we open a subprocess that executes sox's rec command to record
# audio from the system's default microphone.
import subprocess
import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)
# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)
# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())
print("Models:")
for model in modelResp.models:
print(model)
# Select a model ID from the list above. Going with the first model
# in this example.
m = modelResp.models[0]
modelID = m.id
# Loading reference voiceprints.
voiceprints = []
for p in ["user1.bin", "user2.bin", "user3.bin"]:
with open(p, 'r') as f:
voiceprints.append(voicebio.Voiceprint(data=f.read().strip()))
# Setting audio format to be raw 16-bit signed little endian audio samples
# recorded at the sample rate expected by the model.
cfg = voicebio.IdentificationConfig(
model_id=modelID,
voiceprints=voiceprints,
audio_format=voicebio.AudioFormat(
audio_format_raw=voicebio.AudioFormatRAW(
encoding="AUDIO_ENCODING_SIGNED",
bit_depth=16,
byte_order="BYTE_ORDER_LITTLE_ENDIAN",
sample_rate=m.attributes.sample_rate,
channels=1,
)
),
)
# Open microphone stream using sox's rec command and record
# audio using the config specified above for *10 seconds*.
maxDuration = 10
cmd = f"rec -t raw -r {m.attributes.sample_rate} -e signed -b 16 -L -c 1 - trim 0 {maxDuration}"
mic = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
audio = mic.stdout
try:
_ = audio.read(1024) # Trying to read some bytes as sanity check.
except Exception as err:
print(f"[ERROR] failed to read audio from mic stream: {err}")
print(f"\n[INFO] recording {maxDuration} seconds of audio microphone ... \n")
# The first request to the server should only contain the
# recognition configuration. Subsequent requests should contain
# audio bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
yield voicebio.StreamingIdentifyRequest(config=cfg)
data = audio.read(bufferSize)
while len(data) > 0:
yield voicebio.StreamingIdentifyRequest(audio=voicebio.Audio(data=data))
data = audio.read(bufferSize)
# Streaming audio to the server.
result = client.StreamingIdentify(stream(cfg, audio))
# Server returns the index of the voiceprint that matches the best, a similarity
# score for each voiceprint along with whether the score exceeded the server-configured
# threshold for being a match.
#
# If none of the voiceprints were a good match, the best match index will be negative.
matched = "❌ No Match found"
if result.best_match_index >= 0:
best_score = result.voiceprint_comparison_results[result.best_match_index].similarity_score
matched = f"✅ Match found: Index: {result.best_match_index}, Score: {best_score:1.3f}"
print(f"\nIdentification Result:\n")
print("Scores:")
for i, r in enumerate(result.voiceprint_comparison_results):
print(f"Index: {i}, Score: {r.similarity_score:1.3f}, IsMatch: {r.is_match}")
print(f"\n{matched}")
audio.close()
mic.kill()package main
import (
"context"
"errors"
"fmt"
"io"
"os"
"os/exec"
"strings"
"golang.org/x/sync/errgroup"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := voicebio.NewVoiceBioServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
// Get list model of models on the server.
modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
if err != nil {
fmt.Printf("failed to get model list: %v\n", err)
os.Exit(1)
}
fmt.Println("Models:")
for _, m := range modelResp.Models {
fmt.Println(m)
}
fmt.Println()
// Selecting first model.
m := modelResp.Models[0]
// Reading voiceprint data.
voiceprints := make([]*voicebio.Voiceprint, 0)
for i, p := range []string{"user1.bin", "user2.bin", "user3.bin"} {
data, err := os.ReadFile(p)
if err != nil {
fmt.Printf("\nfailed to read voiceprint[%d] data: %v\n", i, err)
os.Exit(1)
}
voiceprints = append(voiceprints, &voicebio.Voiceprint{Data: string(data)})
}
// Setting audio format to be raw 16-bit signed little endian audio samples
// recorded at the sample rate expected by the model.
cfg := &voicebio.IdentificationConfig{
ModelId: m.Id,
Voiceprints: voiceprints,
AudioFormat: &voicebio.AudioFormat{AudioFormat: &voicebio.AudioFormat_AudioFormatRaw{
AudioFormatRaw: &voicebio.AudioFormatRAW{
Encoding: voicebio.AudioEncoding_AUDIO_ENCODING_SIGNED,
SampleRate: m.Attributes.SampleRate,
BitDepth: 16,
ByteOrder: voicebio.ByteOrder_BYTE_ORDER_LITTLE_ENDIAN,
Channels: 1,
},
},
},
}
// Open microphone stream using sox's rec command and record
// audio using the config specified above for *10 seconds*.
maxDuration := 10
args := fmt.Sprintf("-t raw -r %d -e signed -b 16 -L -c 1 - trim 0 %d", m.Attributes.SampleRate, maxDuration)
cmd := exec.CommandContext(ctx, "rec", strings.Fields(args)...)
cmd.Stderr = os.Stderr
audio, err := cmd.StdoutPipe()
if err != nil {
fmt.Printf("failed to open microphone stream: %v\n", err)
os.Exit(1)
}
// Starting routines to record from microphone and stream to server
// using an errgroup.Group that returns if either one encounters an error.
eg, ctx := errgroup.WithContext(ctx)
eg.Go(func() error {
fmt.Printf("\n[INFO] recording %d seconds from microphone \n", maxDuration)
if err := cmd.Run(); err != nil {
return fmt.Errorf("record from microphone: %w", err)
}
return nil
})
// Starting identification.
result, err := StreamingIdentify(ctx, client, cfg, audio)
if err != nil {
fmt.Printf("failed to run streaming identification: %v\n", err)
os.Exit(1)
}
// Server returns the index of the voiceprint that matches the best, a similarity
// score for each voiceprint along with whether the score exceeded the server-configured
// threshold for being a match.
//
// If none of the voiceprints were a good match, the best match index will be negative.
matched := "❌ No Match found"
if result.BestMatchIndex >= 0 {
bestScore := result.VoiceprintComparisonResults[result.BestMatchIndex].SimilarityScore
matched = fmt.Sprintf("✅ Match found: Index: %d, Score: %1.3f", result.BestMatchIndex, bestScore)
}
fmt.Printf("\nIdentification Result:\n")
fmt.Printf("Scores:\n")
for i, r := range result.VoiceprintComparisonResults {
fmt.Printf("Index: %d, Score: %1.3f, IsMatch: %v\n", i, r.SimilarityScore, r.IsMatch)
}
fmt.Printf("\n%s\n", matched)
}
// StreamingIdentify wraps the streaming API for performing speaker identification
// using the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to VoiceBio
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
func StreamingIdentify(
ctx context.Context,
client voicebio.VoiceBioServiceClient,
cfg *voicebio.IdentificationConfig,
audio io.Reader,
) (*voicebio.StreamingIdentifyResponse, error) {
const (
streamingBufSize = 1024
)
// Creating stream.
stream, err := client.StreamingIdentify(ctx)
if err != nil {
return nil, err
}
// Sending audio.
if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
// if sendAudio encountered io.EOF, it's only a
// notification that the stream has closed. The actual
// status will be obtained in the CloseAndRecv call. We
// therefore return on non-EOF errors here.
return nil, err
}
// Returning result.
return stream.CloseAndRecv()
}
// sendAudio sends the config and audio to a stream.
func sendAudio(
stream voicebio.VoiceBioService_StreamingIdentifyClient,
cfg *voicebio.IdentificationConfig,
audio io.Reader,
bufSize uint32,
) error {
// The first message needs to be a config message, and all subsequent
// messages must be audio messages.
// Send the config.
if err := stream.Send(&voicebio.StreamingIdentifyRequest{
Request: &voicebio.StreamingIdentifyRequest_Config{Config: cfg},
}); err != nil {
// if this failed, we don't need to CloseSend
return err
}
// Stream the audio.
buf := make([]byte, bufSize)
for {
n, err := audio.Read(buf)
if n > 0 {
if err2 := stream.Send(&voicebio.StreamingIdentifyRequest{
Request: &voicebio.StreamingIdentifyRequest_Audio{
Audio: &voicebio.Audio{Data: buf[:n]},
},
}); err2 != nil {
// if we couldn't Send, the stream has
// encountered an error and we don't need to
// CloseSend.
return err2
}
}
if err != nil {
// err could be io.EOF, or some other error reading from
// audio. In any case, we need to CloseSend, send the
// appropriate error to errCh and return from the function
if err2 := stream.CloseSend(); err2 != nil {
return err2
}
if err != io.EOF {
return err
}
return nil
}
}
}2.2.7 - Comparing Voiceprints
CompareVoiceprints API.-
The
CompareVoiceprintsendpoint allows the user to compare pre-extracted voiceprints and get similarity scores and match results without needing to send audio data. -
This is useful in cases where the user wants to compare a given voiceprint against a large number of other voiceprints, and sending audio data for each comparison would be inefficient. The client can enroll the voiceprint once using the
StreamingEnrollmethod, and then use this method to compare it against a large number of other voiceprints in batches. -
The following example shows how to compare pre-extracted voiceprints using VoiceBio’s
CompareVoiceprintsAPI, without streaming audio. The voiceprints can be loaded from files on disk or obtained from previous enrollment sessions.
Info
Voiceprints provided inCompareVoiceprints requests must be generated using the
same or compatible model via StreamingEnroll.
import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)
# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)
# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())
print("Models:")
for model in modelResp.models:
print(model)
# Select a model ID from the list above. Going with the first model
# in this example. The model ID should be the same as the one used to
# generate the voiceprints being compared.
modelID = modelResp.models[0].id
# Loading reference voiceprints.
reference_voiceprints = []
for p in ["user1.bin", "user2.bin", "user3.bin"]:
with open(p, 'r') as f:
reference_voiceprints.append(voicebio.Voiceprint(data=f.read().strip()))
# Load the target voiceprint that we want to compare against the reference voiceprints.
with open("unknown.bin", 'r') as f:
target_voiceprint = voicebio.Voiceprint(data=f.read().strip())
# Set the comparison config.
req = voicebio.CompareVoiceprintsRequest(
model_id=modelID,
target_voiceprint=target_voiceprint,
reference_voiceprints=reference_voiceprints,
)
# Compare voiceprints.
result = client.CompareVoiceprints(req)
# Server returns the index of the voiceprint that matches the best, a similarity
# score for each voiceprint along with whether the score exceeded the server-configured
# threshold for being a match.
#
# If none of the voiceprints were a good match, the best match index will be negative.
matched = "❌ No Match found"
if result.best_match_index >= 0:
best_score = result.voiceprint_comparison_results[result.best_match_index].similarity_score
matched = f"✅ Match found: Index: {result.best_match_index}, Score: {best_score:1.3f}"
print(f"\nComparison Result:\n")
print("Scores:")
for i, r in enumerate(result.voiceprint_comparison_results):
print(f"Index: {i}, Score: {r.similarity_score:1.3f}, IsMatch: {r.is_match}")
print(f"\n{matched}")package main
import (
"context"
"fmt"
"os"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := voicebio.NewVoiceBioServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
// Get list model of models on the server.
modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
if err != nil {
fmt.Printf("failed to get model list: %v\n", err)
os.Exit(1)
}
fmt.Println("Models:")
for _, m := range modelResp.Models {
fmt.Println(m)
}
fmt.Println()
// Reading voiceprint data.
reference_voiceprints := make([]*voicebio.Voiceprint, 0)
for i, p := range []string{"user1.bin", "user2.bin", "user3.bin"} {
data, err := os.ReadFile(p)
if err != nil {
fmt.Printf("\nfailed to read voiceprint[%d] data: %v\n", i, err)
os.Exit(1)
}
reference_voiceprints = append(reference_voiceprints, &voicebio.Voiceprint{Data: string(data)})
}
// Load the target voiceprint that we want to compare against the reference voiceprints.
data, err := os.ReadFile("unknown.bin")
if err != nil {
fmt.Printf("failed to read target voiceprint data: %v\n", err)
os.Exit(1)
}
target_voiceprint := &voicebio.Voiceprint{Data: string(data)}
// Selecting the first model. The model ID should be the same as the one used to generate the
// voiceprints being compared.
req := &voicebio.CompareVoiceprintsRequest{
ModelId: modelResp.Models[0].Id,
TargetVoiceprint: target_voiceprint,
ReferenceVoiceprints: reference_voiceprints,
}
// Compare voiceprints.
result, err := client.CompareVoiceprints(ctx, req)
if err != nil {
fmt.Printf("failed to compare voiceprints: %v\n", err)
os.Exit(1)
}
// Server returns the index of the voiceprint that matches the best, a similarity
// score for each voiceprint along with whether the score exceeded the server-configured
// threshold for being a match.
//
// If none of the voiceprints were a good match, the best match index will be negative.
matched := "❌ No Match found"
if result.BestMatchIndex >= 0 {
bestScore := result.VoiceprintComparisonResults[result.BestMatchIndex].SimilarityScore
matched = fmt.Sprintf("✅ Match found: Index: %d, Score: %1.3f", result.BestMatchIndex, bestScore)
}
fmt.Printf("\n Comparison Result:\n")
fmt.Printf("Scores:\n")
for i, r := range result.VoiceprintComparisonResults {
fmt.Printf("Index: %d, Score: %1.3f, IsMatch: %v\n", i, r.SimilarityScore, r.IsMatch)
}
fmt.Printf("\n%s\n", matched)
}2.2.8 - Vectorizing Voiceprints
VectorizeVoiceprints API.-
Voiceprints can also be vectorized using the
VectorizeVoiceprintsAPI, which returns a vector representation of each voiceprint that can be used for downstream tasks such as clustering, custom scoring, other machine learning models or even semantic searching in vector databases. -
See the API reference for more details.
-
The following example shows how to use the
VectorizeVoiceprintsAPI to vectorize voiceprints. The voiceprints can be loaded from files on disk or obtained from previous enrollment sessions.
Info
Voiceprints provided inVectorizeVoiceprints requests must be generated using the
same or compatible model via StreamingEnroll.
import numpy as np
import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)
# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)
# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())
print("Models:")
for model in modelResp.models:
print(model)
# Select a model ID from the list above. Going with the first model
# in this example. The model ID should be the same as the one used to
# generate the voiceprints being vectorized.
modelID = modelResp.models[0].id
# Loading voiceprints.
voiceprints = []
for p in ["user1.bin", "user2.bin", "user3.bin"]:
with open(p, 'r') as f:
voiceprints.append(voicebio.Voiceprint(data=f.read().strip()))
# Set the vectorization config.
req = voicebio.VectorizeVoiceprintsRequest(
model_id=modelID,
voiceprints=voiceprints,
)
# Vectorize voiceprints.
result = client.VectorizeVoiceprints(req)
# The server returns a list of vectorized voiceprints in the same order as the input voiceprints.
#
# In most cases, the vectorized voiceprints can be compared using simple distance metrics such as
# cosine similarity or euclidean distance. This is not guaranteed, however, and depends on the model
# used to generate the voiceprints and vectorize them.
# Example using cosine similarity.
n = len(result.voiceprints)
similarity = np.zeros((n, n), dtype=np.float32)
for i, vi in enumerate(result.voiceprints):
for j, vj in enumerate(result.voiceprints):
similarity[i, j] = np.dot(vi.data, vj.data) / (np.linalg.norm(vi.data) * np.linalg.norm(vj.data))
print("Cosine Similarity Matrix:")
print(similarity)package main
import (
"context"
"fmt"
"os"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := voicebio.NewVoiceBioServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
// Get list model of models on the server.
modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
if err != nil {
fmt.Printf("failed to get model list: %v\n", err)
os.Exit(1)
}
fmt.Println("Models:")
for _, m := range modelResp.Models {
fmt.Println(m)
}
fmt.Println()
// Reading voiceprint data.
voiceprints := make([]*voicebio.Voiceprint, 0)
for i, p := range []string{"user1.bin", "user2.bin", "user3.bin"} {
data, err := os.ReadFile(p)
if err != nil {
fmt.Printf("\nfailed to read voiceprint[%d] data: %v\n", i, err)
os.Exit(1)
}
voiceprints = append(voiceprints, &voicebio.Voiceprint{Data: string(data)})
}
// Selecting the first model. The model ID should be the same as the one used to generate the
// voiceprints being compared.
req := &voicebio.VectorizeVoiceprintsRequest{
ModelId: modelResp.Models[0].Id,
Voiceprints: voiceprints,
}
// Vectorize voiceprints.
result, err := client.VectorizeVoiceprints(ctx, req)
if err != nil {
fmt.Printf("failed to vectorize voiceprints: %v\n", err)
os.Exit(1)
}
// The server returns a list of vectorized voiceprints in the same order as the input voiceprints.
//
// In almost cases, the vectorized voiceprints can be compared using simple distance metrics such as
// cosine similarity or euclidean distance. This is not guaranteed, however, and depends on the model
// used to generate the voiceprints and vectorize them.
// Example using cosine similarity.
n := len(result.Voiceprints)
similarity := make([][]float32, n)
for i := range similarity {
similarity[i] = make([]float32, n)
}
for i, vi := range result.Voiceprints {
for j, vj := range result.Voiceprints {
dotProduct := float32(0.0)
normVi := float32(0.0)
normVj := float32(0.0)
for k := range vi.Data {
dotProduct += vi.Data[k] * vj.Data[k]
normVi += vi.Data[k] * vi.Data[k]
normVj += vj.Data[k] * vj.Data[k]
}
denom := float32(math.Sqrt(float64(normVi)) * math.Sqrt(float64(normVj)))
similarity[i][j] = dotProduct / denom
}
}
fmt.Printf("Cosine Similarity Matrix:\n")
for i := range similarity {
for j := range similarity[i] {
fmt.Printf("%1.3f ", similarity[i][j])
}
fmt.Println()
}
}2.2.9 - API Reference
The API is defined as a protobuf spec, so native bindings can be generated in any language with gRPC support. We recommend using buf to generate the bindings.
This section of the documentation is auto-generated from the protobuf spec. The service contains the methods that can be called, and the “messages” are the data structures (objects, classes or structs in the generated code, depending on the language) passed to and from the methods.
VoiceBioService
Service that implements the Cobalt VoiceBio API.
Version
Version(VersionRequest) VersionResponse
Returns version information from the server.
ListModels
ListModels(ListModelsRequest) ListModelsResponse
Returns information about the models available on the server.
StreamingEnroll
StreamingEnroll(StreamingEnrollRequest) StreamingEnrollResponse
Uses new audio data to perform enrollment of new users, or to update enrollment of existing users. Returns a new or updated voiceprint.
Clients should store the returned voiceprint against the ID of the user that provided the audio. This voiceprint can be provided later, with the Verify or Identify requests to match new audio against known speakers.
If this call is used to update an existing user’s voiceprint, the old voiceprint can be discarded and only the new one can be stored for that user.
StreamingVerify
StreamingVerify(StreamingVerifyRequest) StreamingVerifyResponse
Compares audio data against the provided voiceprint and verifies whether or not the audio matches against the voiceprint.
StreamingIdentify
StreamingIdentify(StreamingIdentifyRequest) StreamingIdentifyResponse
Compares audio data against the provided list of voiceprints and identifies which (or none) of the voiceprints is a match for the given audio.
VectorizeVoiceprints
VectorizeVoiceprints(VectorizeVoiceprintsRequest) VectorizeVoiceprintsResponse
Converts the given voiceprints into numerical vector representations that can be used for various downstream tasks such as clustering, visualization, or as input features for other machine learning models. The specific format and dimensionality of these vectors may vary depending on the model used.
CompareVoiceprints
CompareVoiceprints(CompareVoiceprintsRequest) CompareVoiceprintsResponse
Compares pre-extracted voiceprints and returns similarity scores and match
results without needing to send audio data. This is useful in cases where
the user wants to compare a given voiceprint against a large number of
other voiceprints, and sending audio data for each comparison would be
inefficient. The client can enroll the voiceprint once using the
StreamingEnroll method, and then use this method to compare it against a
large number of other voiceprints in batches.
Messages
- If two or more fields in a message are labeled oneof, then each method call using that message must have exactly one of the fields populated
- If a field is labeled
repeated, then the generated code will accept an array (or struct, or list depending on the language).
Audio
Audio to be sent to VoiceBio.
Fields
- data (bytes )
AudioFormat
Format of the audio to be sent for recognition.
Depending on how they are configured, server instances of this service may not support all the formats provided in the API. One format that is guaranteed to be supported is the RAW format with little-endian 16-bit signed samples with the sample rate matching that of the model being requested.
Fields
-
oneof audio_format.audio_format_raw (AudioFormatRAW ) Audio is raw data without any headers
-
oneof audio_format.audio_format_headered (AudioFormatHeadered ) Audio has a self-describing header. Headers are expected to be sent at the beginning of the entire audio file/stream, and not in every
Audiomessage.The default value of this type is AUDIO_FORMAT_HEADERED_UNSPECIFIED. If this value is used, the server may attempt to detect the format of the audio. However, it is recommended that the exact format be specified.
AudioFormatRAW
Details of audio in raw format
Fields
-
encoding (AudioEncoding ) Encoding of the samples. It must be specified explicitly and using the default value of
AUDIO_ENCODING_UNSPECIFIEDwill result in an error. -
bit_depth (uint32 ) Bit depth of each sample (e.g. 8, 16, 24, 32, etc.). This is a required field.
-
byte_order (ByteOrder ) Byte order of the samples. This field must be set to a value other than
BYTE_ORDER_UNSPECIFIEDwhen thebit_depthis greater than 8. -
sample_rate (uint32 ) Sampling rate in Hz. This is a required field.
-
channels (uint32 ) Number of channels present in the audio. E.g.: 1 (mono), 2 (stereo), etc. This is a required field.
CompareVoiceprintsRequest
The top level message sent by the client for the CompareVoiceprints method.
This is similar to StreamingIdentifyRequest, but operates on pre-extracted
voiceprints without sending any audio data.
Fields
-
model_id (string ) ID of the model to use for comparison. The model used for comparison must match with the model used for enrollment of the voiceprints. A list of supported IDs can be found using the
ListModelscall. -
target_voiceprint (Voiceprint ) The voiceprint to compare against the reference voiceprints.
-
reference_voiceprints (Voiceprint repeated) Voiceprints that should be compared against the target voiceprint.
CompareVoiceprintsResponse
The message returned by the server for the CompareVoiceprints method. This
contains the similarity scores and match results for comparing the target
voiceprint against each of the reference voiceprints, as well as the index of
the best matching voiceprint in the reference list, if any of them is a
match. This is similar to StreamingIdentifyResponse, but operates on
pre-extracted voiceprints without sending any audio data.
Fields
-
best_match_index (int32 ) Index (0-based) of the best matching voiceprint in the list of reference voiceprints provided in the
CompareVoiceprintsRequestmessage. If none of the voiceprints was a match, a negative value is returned. -
voiceprint_comparison_results (VoiceprintComparisonResult repeated) Result of comparing the given the target voiceprint against each of the reference voiceprints. The order of this list is the same as the reference voiceprint list provided in the
CompareVoiceprintsRequestmessage.
EnrollmentConfig
Configuration for Enrollment of speakers.
Fields
-
model_id (string ) ID of the model to use for enrollment. A list of supported IDs can be found using the
ListModelscall. -
audio_format (AudioFormat ) Format of the audio to be sent for enrollment.
-
previous_voiceprint (Voiceprint ) Empty string for new users. For re-enrolling additional users with new audio data, set this to that user’s previous voiceprint. The previous voiceprint needs to have been generated using the same model as specified in this config.
EnrollmentStatus
The message returned as part of StreamingEnrollResponse, to provide information about whether voiceprint is sufficiently trained.
Fields
-
enrollment_complete (bool ) Whether sufficient data has been provided as part of this user’s enrollment. If this is false, more audio should be collected from the user and re-enrollment should be done. If this is true, it is still OK to enroll more data for the same user to update the voiceprint.
-
additional_audio_required_seconds (uint32 ) If enrollment is not yet complete, how many more seconds of user’s speech are required to complete the enrollment. If enrollment is completed successfully, this value will be set to 0.
IdentificationConfig
Configuration for Identification of a speaker.
Fields
-
model_id (string ) ID of the model to use for identification. A list of supported IDs can be found using the
ListModelscall. The model used for identification must match with the model used for enrollment. -
audio_format (AudioFormat ) Format of the audio to be sent for enrollment.
-
voiceprints (Voiceprint repeated) Voiceprints of potential speakers that need to be identified in the given audio.
ListModelsRequest
The top-level message sent by the client for the ListModels method.
ListModelsResponse
The message returned to the client by the ListModels method.
Fields
- models (Model repeated) List of models available for use that match the request.
Model
Description of a VoiceBio model.
Fields
-
id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for enrollment, verification or identification requests. This ID needs to be specified in the respective config messages for these requests.
-
name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their voicebio task.
-
attributes (ModelAttributes ) Model Attributes
ModelAttributes
Attributes of a VoiceBio model
Fields
- sample_rate (uint32 ) Audio sample rate (native) supported by the model
StreamingEnrollRequest
The top level messages sent by the client for the StreamingEnroll method.
In this streaming call, multiple StreamingEnrollRequest messages should be
sent. The first message must contain a EnrollmentConfig message, and all
subsequent messages must contain Audio only. All Audio messages must
contain non-empty audio. If audio content is empty, the server may choose to
interpret it as end of stream and stop accepting any further messages.
Fields
-
oneof request.config (EnrollmentConfig )
StreamingEnrollResponse
The message returned by the server for the StreamingEnroll method.
Fields
-
voiceprint (Voiceprint )
-
enrollment_status (EnrollmentStatus )
StreamingIdentifyRequest
The top level messages sent by the client for the StreamingIdentify method.
In this streaming call, multiple StreamingIdentifyRequest messages should
be sent. The first message must contain a IdentificationConfig message, and
all subsequent messages must contain Audio only. All Audio messages must
contain non-empty audio. If audio content is empty, the server may choose to
interpret it as end of stream and stop accepting any further messages.
Fields
-
oneof request.config (IdentificationConfig )
StreamingIdentifyResponse
The message returned by the server for the StreamingIdentify method.
Fields
-
best_match_index (int32 ) Index (0-based) of the best matching voiceprint in the list of input voiceprints provided in the
IdentificationConfigmessage. If none of the voiceprints was a match, a negative value is returned. -
voiceprint_comparison_results (VoiceprintComparisonResult repeated) Result of comparing the given audio against each of the input voiceprints. The order of this list is the same as the input voiceprint list provided in the
IdentificationConfigmessage.
StreamingVerifyRequest
The top level messages sent by the client for the StreamingVerify method.
In this streaming call, multiple StreamingVerifyRequest messages should be
sent. The first message must contain a VerificationConfig message, and all
subsequent messages must contain Audio only. All Audio messages must
contain non-empty audio. If audio content is empty, the server may choose to
interpret it as end of stream and stop accepting any further messages.
Fields
-
oneof request.config (VerificationConfig )
StreamingVerifyResponse
The message returned by the server for the StreamingVerify method.
Fields
- result (VoiceprintComparisonResult )
VectorVoiceprint
Voiceprint represented in vector form. The specific format and dimensionality
of this vector may vary depending on the model used. The VectorizeVoiceprints
method can be used to convert a Voiceprint into a VectorVoiceprint
representation.
Fields
- data (float repeated) List of floating point values representing the voiceprint in vector form.
VectorizeVoiceprintsRequest
The top level message sent by the client for the VectorizeVoiceprints method.
Fields
-
model_id (string ) ID of the model to use for vectorization. The model used for vectorization must match with the model used for enrollment of the voiceprints. A list of supported IDs can be found using the
ListModelscall. -
voiceprints (Voiceprint repeated) Voiceprints to be vectorized.
VectorizeVoiceprintsResponse
The message returned by the server for the VectorizeVoiceprints method.
Fields
-
voiceprints (VectorVoiceprint repeated) Voiceprint data converted into a vector representation, which can be used for various downstream tasks such as clustering, visualization, or as input features for other machine learning models. The specific format and dimensionality of these vectors may vary depending on the model used.
The order of this list is the same as the input voiceprint list provided in the
VectorizeVoiceprintsRequestmessage.
VerificationConfig
Configuration for Verification of a speaker.
Fields
-
model_id (string ) ID of the model to use for verification. A list of supported IDs can be found using the
ListModelscall. The model used for verification must match with the model used for enrollment. -
audio_format (AudioFormat ) Format of the audio to be sent for enrollment.
-
voiceprint (Voiceprint ) Voiceprint with which audio should be compared.
VersionRequest
The top-level message sent by the client for the Version method.
VersionResponse
The message sent by the server for the Version method.
Fields
- version (string ) Version of the server handling these requests.
Voiceprint
Voiceprint extracted from user’s audio.
Fields
- data (string ) Voiceprint data serialized to a string.
VoiceprintComparisonResult
Message describing the result of comparing a voiceprint against given audio.
Fields
-
is_match (bool ) Whether or not the audio successfully matches with the provided voiceprint.
-
similarity_score (float ) Similarity score representing how closely the audio matched against the voiceprint. This score could be any negative or positive number. Lower value suggests that the audio and voiceprints are less similar, whereas a higher value indicates more similarity. The
is_matchfield can be used to actually decide if the result should be considered a valid match.
Enums
AudioEncoding
The encoding of the audio data to be sent for recognition.
| Name | Number | Description |
|---|---|---|
| AUDIO_ENCODING_UNSPECIFIED | 0 | AUDIO_ENCODING_UNSPECIFIED is the default value of this type and will result in an error. |
| AUDIO_ENCODING_SIGNED | 1 | PCM signed-integer |
| AUDIO_ENCODING_UNSIGNED | 2 | PCM unsigned-integer |
| AUDIO_ENCODING_IEEE_FLOAT | 3 | PCM IEEE-Float |
| AUDIO_ENCODING_ULAW | 4 | G.711 mu-law |
| AUDIO_ENCODING_ALAW | 5 | G.711 a-law |
AudioFormatHeadered
| Name | Number | Description |
|---|---|---|
| AUDIO_FORMAT_HEADERED_UNSPECIFIED | 0 | AUDIO_FORMAT_HEADERED_UNSPECIFIED is the default value of this type. |
| AUDIO_FORMAT_HEADERED_WAV | 1 | WAV with RIFF headers |
| AUDIO_FORMAT_HEADERED_MP3 | 2 | MP3 format with a valid frame header at the beginning of data |
| AUDIO_FORMAT_HEADERED_FLAC | 3 | FLAC format |
| AUDIO_FORMAT_HEADERED_OGG_OPUS | 4 | Opus format with OGG header |
ByteOrder
Byte order of multi-byte data
| Name | Number | Description |
|---|---|---|
| BYTE_ORDER_UNSPECIFIED | 0 | BYTE_ORDER_UNSPECIFIED is the default value of this type. |
| BYTE_ORDER_LITTLE_ENDIAN | 1 | Little Endian byte order |
| BYTE_ORDER_BIG_ENDIAN | 2 | Big Endian byte order |
Scalar Value Types
2.2.10 -

Cobalt VoiceBio SDK – Cobalt
3.1 - VoiceGen
3.1.1 - Getting Started
Using Cobalt VoiceGen
-
A typical VoiceGen release, provided as a compressed archive, will contain a linux binary (
voicegen-server) for the required native CPU architecture, appropriate Dockerfile and models. -
Cobalt VoiceGen runs either locally on linux or using Docker.
-
Cobalt VoiceGen will serve the GRPC API on port 2727. A web demo will be enabled on port 8080.
-
To quickly try out VoiceGen, first start the server as shown below and open the web demo at
http://localhost:8080in your browser to input text and play / download synthesized audio. You can also use the SDK in your preferred language to use VoiceGen from the command line or within your application.
Info
Thecobalt.license.key file will be provided separately that must be copied into
the directory resulting from decompressing the archive. Please do this before
running the steps below.
Running VoiceGen Server Locally on Linux
./voicegen-server
- By default, the binary assumes the presence of a configuration file, located
in the same directory, named:
voicegen-server.cfg.toml. A different config file may be specified using the--configargument.
Running VoiceGen Server as a Docker Container
To build and run the Docker image for VoiceGen, run:
docker build -t cobalt-voicegen .
docker run -p 2727:2727 -p 8080:8080 cobalt-voicegen
How to Get a Copy of the VoiceGen Server and Models
Contact us for getting a release best suited to your requirements.
The release you will receive is a compressed archive (tar.bz2) and is generally structured accordingly:
release.tar.bz2
├── COPYING
├── README.md
├── voicegen-server
├── voicegen-server.cfg.toml
├── Dockerfile
├── models
│ └── en_US-multispeaker-22050hz
│
└── cobalt.license.key [ provided separately, needs to be copied over ]
-
The
README.mdfile contains information about this release and instructions for how to start the server on your system. -
The
voicegen-serveris the server program which is configured using thevoicegen-server.cfg.tomlfile. -
The
Dockerfilecan be used to create a container that will let you run VoiceGen server on non-linux systems such as MacOS and Windows. -
The
modelsdirectory contains the speech synthesis models. The content of these directory will depend on the models you are provided.
System Requirements
Cobalt VoiceGen runs on Linux. You can run it directly as a linux application.
You can evaluate the product on Windows or Linux using Docker Desktop but we would not recommend this setup for use in a production environment.
A Cobalt VoiceGen release typically includes a single model together with binaries and config files. VoiceGen models may take up to 250MB of disk space, and need a minimum of 2GB RAM when evaluating locally. For production workloads, we recommend configuring containerized applications with each instance allocated with 4 CPUs and 4GB RAM.
Cobalt VoiceGen runs on x86_64 CPUs. We also support Arm64 CPUs, including processors such as the Graviton (AWS c7g EC2 instances). VoiceGen is significantly more cost effective to run on C7g instances compared to similarly sized Intel or AMD processors, and we can provide you an Arm64 release on request.
To integrate Cobalt VoiceGen into your application, please follow the next steps to install or generate the SDK in a language of your choice.
3.1.2 - Generating SDKs
-
APIs for all Cobalt’s services are defined as a protocol buffer specification or simply a
protofile and be found in thecobaltspeech/protogithub repository. -
The
protofile allows a developer to auto-generate client SDKs for a number of different programming languages. Step by step instructions for generating your own SDK can be found below. -
We provide pre-generated SDKs for a couple of languages. You can choose to use these instead of generating your own. These are listed here along with instructions on how to install / import them into your projects.
Pre-generated SDKs
Golang
-
Pre-generated SDK files for Golang can be found in the
cobaltspeech/go-genprotorepo -
To use it in your Go project, simply import it:
import voicegenpb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicegen/v1"
- An example client using the above repo can be found here.
Python
-
Pre-generated SDK files for Python can be found in the
cobaltspeech/py-genprotorepo -
The Python SDK depends on Python >= 3.5. You may use pip to perform a system-wide install, or use virtualenv for a local install. To use it in your Python project, install it:
pip install --upgrade pip
pip install "git+https://github.com/cobaltspeech/py-genproto"
Generating SDKs
Step 1. Installing buf
- To work with
protofiles, we recommend usingbuf, a user-friendly command line tool that can be configured generate documentation, schemas and SDK code for different languages.
# Latest version as of March 14th, 2023.
COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/bin"
VERSION="1.15.1"
URL="https://github.com/bufbuild/buf/releases/download/v${VERSION}/buf-$(uname -s)-$(uname -m)"
curl -L ${URL} -o "${COBALT}/bin/buf"
# Give executable permissions and adding to $PATH.
chmod +x "${COBALT}/bin/buf"
export PATH="${PATH}:${COBALT}/bin"brew install bufbuild/buf/bufStep 2. Getting proto files
- Clone the
cobaltspeech/protorepository:
COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/git"
# Change this to where you want to clone the repo to.
PROTO_REPO="${COBALT}/git/proto"
git clone https://github.com/cobaltspeech/proto "${PROTO_REPO}"
Step 3. Generating code
-
The
cobaltspeech/protorepo provides abuf.gen.yamlconfig file to get you started with a couple of languages. -
Other plugins can be added to the
buf.gen.yamlfile to generate SDK code for more languages. -
To generate the SDKs, simply run the following (assuming the
bufbinary is in your$PATH)
cd "${PROTO_REPO}"
# Removing any previously generated files.
rm -rf ./gen
# Generating code for all proto files inside the `proto` directory.
buf generate proto
- You should now have a folder called
geninside${PROTO_REPO}that contains the generated code. The latest version of the VoiceGen API is v1. You can import / include / copy the generated files into your projects as per the conventions of different languages.
gen
├── ... other languages ...
└── py
└── cobaltspeech
├── ... other services ...
└── voicegen
└── v1
├── voicegen_pb2_grpc.py
├── voicegen_pb2.py
└── voicegen_pb2.pyigen
├── ... other languages ...
└── go
├── cobaltspeech
│ ├── ...
│ └── voicegen
│ └── v1
│ ├── voicegen_grpc.pb.go
│ └── voicegen.pb.go
└── gw
└── cobaltspeech
├── ...
└── voicegen
└── v1
└── voicegen.pb.gw.goStep 4. Installing gPRC and protobuf
- A couple of gRPC and protobuf dependencies are required along with the code generated above. The method of installing them depends on the programming language being used.
- These dependencies and the most common way of installing/ / including them are listed below for some chosen languages.
# It is encouraged to this inside a python virtual environment
# to avoid creating version conflicts for other scripts that may
# be using these libraries.
pip install --upgrade protobuf
pip install --upgrade grpcio
pip install --upgrade google-api-python-clientgo get google.golang.org/protobuf
go get google.golang.org/grpc
go get google.golang.org/genproto# More details on grpc installation can be found at:
# https://grpc.io/docs/languages/cpp/quickstart/
COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/git"
# Latest version as of 14th March, 2023.
VERSION="v1.52.0"
GRPC_REPO="${COBALT}/git/grpc-${VERSION}"
git clone \
--recurse-submodules --depth 1 --shallow-submodules \
-b "${VERSION}" \
https://github.com/grpc/grpc ${GRPC_REPO}
cd "${GRPC_REPO}"
mkdir -p cmake/build
# Change this to where you want to install libprotobuf and libgrpc.
# It is encouraged to install gRPC locally as there is no easy way to
# uninstall gRPC after you’ve installed it globally.
INSTALL_DIR="${COBALT}"
cd cmake/build
cmake \
-DgRPC_INSTALL=ON \
-DgRPC_BUILD_TESTS=OFF \
-DCMAKE_INSTALL_PREFIX=${INSTALL_DIR} \
../..
make -j
make install3.1.3 - Connecting to the Server
-
Once you have your VoiceGen server up and running, and have installed or generated the SDK for your project, you can connect to a running instance of VoiceGen server, by “dialing” a gRPC connection.
-
First, you need to know the address where the server is running: e.g.
host:grpc_port. By default, this islocalhost:2727and should be logged to the terminal when you first start VoiceGen server asgrpcAddr:
2023/08/14 10:49:38 info {"license":"Copyright © 2015--present. Cobalt Speech and Language, Inc. For additional details, including information about open source components used in this software, please see the COPYING file bundled with this program."}
2023/08/14 10:49:38 info {"msg":"reading config file","path":"configs/voicegen-server.config.toml"}
2023/08/14 10:49:38 info {"msg":"server initializing"}
2023/08/14 10:49:41 info {"msg":"server started","grpcAddr":"[::]:2727","httpApiAddr":"[::]:8080","httpOpsAddr":"[::]:8081"}
Info
If you are hosting your server with Transport Layer Security (TLS) enabled, then please follow the instructions under Connection With TLS. Otherwise, you can follow the instructions for the Default Connection method.Default Connection
The following code snippet connects to the server and queries its version. It connects to the server using an “insecure” gRPC channel. This would be the case if you have just started up a local instance of VoiceGen server without TLS enabled.
import grpc
import cobaltspeech.voicegen.v1.voicegen_pb2_grpc as stub
import cobaltspeech.voicegen.v1.voicegen_pb2 as voicegen
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceGenServiceStub(channel)
# Get server version.
versionResp = client.Version(voicegen.VersionRequest())
print(versionResp)package main
import (
"context"
"fmt"
"os"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
voicegenpb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicegen/v1"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := voicegenpb.NewVoiceGenServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &voicegenpb.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
}Connect with TLS
-
In our recommended setup for deployment, TLS is enabled in the gRPC connection, and when connecting to the server, clients validate the server’s SSL certificate to make sure they are talking to the right party. This is similar to how “https” connections work in web browsers.
-
The following snippets show how to connect to a VoiceGen Server that has TLS enabled. They use the cobalt’s self-hosted demo server at
demo.cobaltspeech.com:2727, but you obviously use your own server instance.
Note
Commercial use of the demo server atdemo.cobaltspeech.com:2727 is not permitted.
This server is for testing and demonstration purposes only and is not guaranteed to
support high availability or high volume. Data uploaded to the server may be stored
for internal purposes.
import grpc
import cobaltspeech.voicegen.v1.voicegen_pb2_grpc as stub
import cobaltspeech.voicegen.v1.voicegen_pb2 as voicegen
serverAddress = "demo.cobaltspeech.com:2727"
# Setup a gRPC connection with TLS. You can optionally provide your own
# root certificates and private key to grpc.ssl_channel_credentials()
# for mutually authenticated TLS.
creds = grpc.ssl_channel_credentials()
channel = grpc.secure_channel(serverAddress, creds)
client = stub.VoiceGenServiceStub(channel)
# Get server version.
versionResp = client.Version(voicegen.VersionRequest())
print(versionResp)package main
import (
"context"
"crypto/tls"
"fmt"
"os"
"time"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials"
voicegenpb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicegen/v1"
)
func main() {
const (
serverAddress = "demo.cobaltspeech.com:2727"
connectTimeout = 10 * time.Second
)
// Setup a gRPC connection with TLS. You can optionally provide your own
// root certificates and private key through tls.Config for mutually
// authenticated TLS.
tlsCfg := tls.Config{}
creds := credentials.NewTLS(&tlsCfg)
ctx, cancel := context.WithTimeout(context.Background(), connectTimeout)
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(creds),
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := voicegenpb.NewVoiceGenServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &voicegenpb.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
}Client Authentication
-
In some setups, it may be desired that the server should also validate clients connecting to it and only respond to the ones it can verify. If your VoiceGen server is configured to do client authentication, you will need to present the appropriate certificate and key when connecting to it.
-
Please note that in the client-authentication mode, the client will still also verify the server’s certificate, and therefore this setup uses mutually authenticated TLS.
-
The following snippets show how to present client certificates when setting up the credentials. These could then be used in the same way as the examples above to connect to a TLS enabled server.
creds = grpc.ssl_channel_credentials(
root_certificates=root_certificates, # PEM certificate as byte string
private_key=private_key, # PEM client key as byte string
certificate_chain=certificate_chain, # PEM client certificate as byte string
)package main
import (
// ...
"crypto/tls"
"crypto/x509"
"fmt"
"os"
// ..
)
func main() {
// ...
// Root PEM certificate for validating self-signed server certificate
var rootCert []byte
// Client PEM certificate and private key.
var certPem, keyPem []byte
caCertPool := x509.NewCertPool()
if ok := caCertPool.AppendCertsFromPEM(rootCert); !ok {
fmt.Printf("unable to use given caCert\n")
os.Exit(1)
}
clientCert, err := tls.X509KeyPair(certPem, keyPem)
if err != nil {
fmt.Printf("unable to use given client certificate and key: %v\n", err)
os.Exit(1)
}
tlsCfg := tls.Config{
RootCAs: caCertPool,
Certificates: []tls.Certificate{clientCert},
}
creds := credentials.NewTLS(&tlsCfg)
// ...
}3.1.4 - Streaming Synthesis
- The following example shows how to synthesize streaming audio from text using
VoiceGen’s
StreamingSynthesizerequest. The audio can be played back as it is being streamed as well as being saved to a file or buffer.
Synthesizing streaming audio and writing to a file
-
We support streaming several headered file formats including WAV, MP3, FLAC etc. as well streaming raw audio samples. For more details, please see the protocol buffer specification here.
-
The examples below show how to submit a chunk of text and receive streaming audio which is written to a file. We will query the server for available models and use the first model for synthesis.
import grpc
import cobaltspeech.voicegen.v1.voicegen_pb2_grpc as stub
import cobaltspeech.voicegen.v1.voicegen_pb2 as voicegen
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceGenServiceStub(channel)
# Get server version.
versionResp = client.Version(voicegen.VersionRequest())
print(versionResp)
# Get list of models on the server.
modelResp = client.ListModels(voicegen.ListModelsRequest())
# A model may be a single-speaker model or a multi-speaker model.
# The speakers available for a model will be printed in the model
# attributes below.
print("Models:")
for model in modelResp.models:
print(model)
# Going with the first model in this example. Also using the first
# speaker available in the model (in case of single-speaker models,
# it is the *only* speaker).
model = modelResp.models[0]
spk = model.attributes.speakers[0]
# Set the synthesis config.
#
# - We could set speaker_id to None to let the server use the default
# speaker configured on the server side.
#
# - We are specifying the output audio format to be WAV with 16 bit signed
# samples, at the model's native sampling rate.
cfg = voicegen.SynthesisConfig(
model_id=model.id,
speaker_id=spk.id,
audio_format=voicegen.AudioFormat(
codec=voicegen.AUDIO_CODEC_WAV,
sample_rate=model.attributes.native_audio_format.sample_rate,
encoding=voicegen.AUDIO_ENCODING_SIGNED,
bit_depth=16,
channels=1,
byte_order=voicegen.BYTE_ORDER_LITTLE_ENDIAN,
),
)
# Specifying text to synthesize, which could be a single line or multiple paragraphs.
# VoiceGen breaks up the text based on its sentence segmentation algorithm as well as
# any line breaks specified in the input text. We intentionally put line breaks here
# to make it look a bit nicer in the code, which are replaced with spaces.
text = voicegen.SynthesisText(text='''
The world's first 3D printed rocket launched successfully on Wednesday, marking
a step forward for the California company behind the innovative spacecraft,
though it failed to reach orbit.
The successful launch came on the third attempt. It had originally been
scheduled to launch on March 8 but was postponed at the last minute because of
propellant temperature issues. A second attempt on March 11 was scrubbed because of
fuel pressure problems.
Had Terran 1 reached low Earth orbit, it would have been the first privately
funded vehicle using methane fuel to do so on its first try, according to
Relativity.
'''.replace("\n", " "))
# Submitting request to the server and writing streamed audio chunks to file.
print("Synthesizing ...")
with open("output.wav", 'wb') as f:
for resp in client.StreamingSynthesize(voicegen.StreamingSynthesizeRequest(config=cfg, text=text)):
f.write(resp.audio.data)package main
import (
"context"
"errors"
"fmt"
"io"
"os"
"strings"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
voicegenpb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicegen/v1"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := voicegenpb.NewVoiceGenServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &voicegenpb.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
// Get list model of models on the server.
modelResp, err := client.ListModels(ctx, &voicegenpb.ListModelsRequest{})
if err != nil {
fmt.Printf("failed to get model list: %v\n", err)
os.Exit(1)
}
// A model may be a single-speaker model or a multi-speaker model.
// The speakers available for a model will be printed in the model
// attributes below.
fmt.Println("Models:")
for _, m := range modelResp.Models {
fmt.Println(m)
}
fmt.Println()
// Going with the first model in this example. Also using the first
// speaker available in the model (in case of single-speaker models,
// it is the *only* speaker).
model := modelResp.Models[0]
spk := model.Attributes.Speakers[0]
// Set the synthesis config.
//
// - We could set speaker_id to None to let the server use the default
// speaker configured on the server side.
//
// - We are specifying the output audio format to be WAV with 16 bit signed
// samples, at the model's native sampling rate.
cfg := &voicegenpb.SynthesisConfig{
ModelId: model.Id,
SpeakerId: spk.Id,
AudioFormat: &voicegenpb.AudioFormat{
Codec: voicegenpb.AudioCodec_AUDIO_CODEC_WAV,
SampleRate: model.Attributes.NativeAudioFormat.SampleRate,
Encoding: voicegenpb.AudioEncoding_AUDIO_ENCODING_SIGNED,
BitDepth: 16,
Channels: 1,
ByteOrder: voicegenpb.ByteOrder_BYTE_ORDER_LITTLE_ENDIAN,
},
}
// Specifying text to synthesize, which could be a single line or multiple paragraphs.
// VoiceGen breaks up the text based on its sentence segmentation algorithm as well as
// any line breaks specified in the input text. We intentionally put line breaks here
// to make it look a bit nicer in the code, which are replaced with spaces.
text := &voicegenpb.SynthesisText{Text: strings.ReplaceAll(`
The world's first 3D printed rocket launched successfully on Wednesday, marking
a step forward for the California company behind the innovative spacecraft,
though it failed to reach orbit.
The successful launch came on the third attempt. It had originally been
scheduled to launch on March 8 but was postponed at the last minute because of
propellant temperature issues. A second attempt on March 11 was scrubbed because of
fuel pressure problems.
Had Terran 1 reached low Earth orbit, it would have been the first privately
funded vehicle using methane fuel to do so on its first try, according to
Relativity.
`, "\n", " ")}
// Submitting request to the server and writing streamed audio chunks to file.
fmt.Println("Synthesizing ...")
stream, err := client.StreamingSynthesize(context.Background(), &voicegenpb.StreamingSynthesizeRequest{Config: cfg, Text: text})
if err != nil {
fmt.Printf("failed to start synthesis stream: %v\n", err)
os.Exit(1)
}
// Opening output audio file.
outF, err := os.Create("output.wav")
if err != nil {
fmt.Printf("failed to open output audio file: %v\n", err)
os.Exit(1)
}
defer outF.Close()
// Receiving audio and writing to file.
for {
resp, err := stream.Recv()
if errors.Is(io.EOF, err) {
return
}
if err != nil {
fmt.Printf("error encountered while synthesizing: %v\n", err)
os.Exit(1)
}
audio := resp.GetAudio()
if audio == nil {
fmt.Printf("error encountered while synthesizing: server returned nil audio")
os.Exit(1)
}
outF.Write(audio.Data)
}
}Synthesizing streaming audio with live playback
-
The synthesized audio stream can be played back live instead of saving it to a file by writing the data to an appropriate interface that can do the playback; typically this requires interaction with system libraries. Another option is to pipe the audio out to an external command line tool like
sox. -
The examples below use the latter approach by using the
playcommand provided withsoxto play the synthesized audio stream live.
import subprocess
import grpc
import cobaltspeech.voicegen.v1.voicegen_pb2_grpc as stub
import cobaltspeech.voicegen.v1.voicegen_pb2 as voicegen
serverAddress = "localhost:2727"
# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceGenServiceStub(channel)
# Get server version.
versionResp = client.Version(voicegen.VersionRequest())
print(versionResp)
# Get list of models on the server.
modelResp = client.ListModels(voicegen.ListModelsRequest())
# A model may be a single-speaker model or a multi-speaker model.
# The speakers available for a model will be printed in the model
# attributes below.
print("Models:")
for model in modelResp.models:
print(model)
# Going with the first model in this example. Also using the first
# speaker available in the model (in case of single-speaker models,
# it is the *only* speaker).
model = modelResp.models[0]
spk = model.attributes.speakers[0]
# Set the synthesis config.
#
# - We could set speaker_id to None to let the server use the default
# speaker configured on the server side.
#
# - We are specifying the output audio format to be WAV with 16 bit signed
# samples, at the model's native sampling rate.
cfg = voicegen.SynthesisConfig(
model_id=model.id,
speaker_id=spk.id,
audio_format=voicegen.AudioFormat(
codec=voicegen.AUDIO_CODEC_WAV,
sample_rate=model.attributes.native_audio_format.sample_rate,
encoding=voicegen.AUDIO_ENCODING_SIGNED,
bit_depth=16,
channels=1,
byte_order=voicegen.BYTE_ORDER_LITTLE_ENDIAN,
),
)
# Specifying text to synthesize, which could be a single line or multiple paragraphs.
# VoiceGen breaks up the text based on its sentence segmentation algorithm as well as
# any line breaks specified in the input text. We intentionally put line breaks here
# to make it look a bit nicer in the code, which are replaced with spaces.
text = voicegen.SynthesisText(text='''
The world's first 3D printed rocket launched successfully on Wednesday, marking
a step forward for the California company behind the innovative spacecraft,
though it failed to reach orbit.
The successful launch came on the third attempt. It had originally been
scheduled to launch on March 8 but was postponed at the last minute because of
propellant temperature issues. A second attempt on March 11 was scrubbed because of
fuel pressure problems.
Had Terran 1 reached low Earth orbit, it would have been the first privately
funded vehicle using methane fuel to do so on its first try, according to
Relativity.
'''.replace("\n", " "))
# Open playback stream using sox's play command as subprocess.
cmd = f"play -t wav -"
play = subprocess.Popen(cmd.split(), stdin=subprocess.PIPE)
out = play.stdin
# Submitting request to the server and writing streamed audio chunks to playback stream.
print("Synthesizing ...")
for resp in client.StreamingSynthesize(voicegen.StreamingSynthesizeRequest(config=cfg, text=text)):
out.write(resp.audio.data)
out.close()
play.wait()
play.kill()package main
import (
"context"
"errors"
"fmt"
"io"
"os"
"os/exec"
"strings"
"golang.org/x/sync/errgroup"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
voicegenpb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicegen/v1"
)
func main() {
const (
serverAddress = "localhost:2727"
)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
opts := []grpc.DialOption{
grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
grpc.WithBlock(),
grpc.WithReturnConnectionError(),
grpc.FailOnNonTempDialError(true),
}
conn, err := grpc.DialContext(ctx, serverAddress, opts...)
if err != nil {
fmt.Printf("failed to dial gRPC connection: %v\n", err)
os.Exit(1)
}
client := voicegenpb.NewVoiceGenServiceClient(conn)
// Get server version.
versionResp, err := client.Version(ctx, &voicegenpb.VersionRequest{})
if err != nil {
fmt.Printf("failed to get server version: %v\n", err)
os.Exit(1)
}
fmt.Printf("%v\n", versionResp)
// Get list model of models on the server.
modelResp, err := client.ListModels(ctx, &voicegenpb.ListModelsRequest{})
if err != nil {
fmt.Printf("failed to get model list: %v\n", err)
os.Exit(1)
}
// A model may be a single-speaker model or a multi-speaker model.
// The speakers available for a model will be printed in the model
// attributes below.
fmt.Println("Models:")
for _, m := range modelResp.Models {
fmt.Println(m)
}
fmt.Println()
// Going with the first model in this example. Also using the first
// speaker available in the model (in case of single-speaker models,
// it is the *only* speaker).
model := modelResp.Models[0]
spk := model.Attributes.Speakers[0]
// Set the synthesis config.
//
// - We could set speaker_id to None to let the server use the default
// speaker configured on the server side.
//
// - We are specifying the output audio format to be WAV with 16 bit signed
// samples, at the model's native sampling rate.
cfg := &voicegenpb.SynthesisConfig{
ModelId: model.Id,
SpeakerId: spk.Id,
AudioFormat: &voicegenpb.AudioFormat{
Codec: voicegenpb.AudioCodec_AUDIO_CODEC_WAV,
SampleRate: model.Attributes.NativeAudioFormat.SampleRate,
Encoding: voicegenpb.AudioEncoding_AUDIO_ENCODING_SIGNED,
BitDepth: 16,
Channels: 1,
ByteOrder: voicegenpb.ByteOrder_BYTE_ORDER_LITTLE_ENDIAN,
},
}
// Specifying text to synthesize, which could be a single line or multiple paragraphs.
// VoiceGen breaks up the text based on its sentence segmentation algorithm as well as
// any line breaks specified in the input text. We intentionally put line breaks here
// to make it look a bit nicer in the code, which are replaced with spaces.
text := &voicegenpb.SynthesisText{Text: strings.ReplaceAll(`
The world's first 3D printed rocket launched successfully on Wednesday, marking
a step forward for the California company behind the innovative spacecraft,
though it failed to reach orbit.
The successful launch came on the third attempt. It had originally been
scheduled to launch on March 8 but was postponed at the last minute because of
propellant temperature issues. A second attempt on March 11 was scrubbed because of
fuel pressure problems.
Had Terran 1 reached low Earth orbit, it would have been the first privately
funded vehicle using methane fuel to do so on its first try, according to
Relativity.
`, "\n", " ")}
// Starting routines to receive audio from server and write to playback stream;
// using an errgroup.Group that returns if either one encounters an error.
eg, ctx := errgroup.WithContext(context.Background())
// Submitting request to the server and writing streamed audio chunks to file.
fmt.Println("Synthesizing ...")
stream, err := client.StreamingSynthesize(ctx, &voicegenpb.StreamingSynthesizeRequest{Config: cfg, Text: text})
if err != nil {
fmt.Printf("failed to start synthesis stream: %v\n", err)
os.Exit(1)
}
// Open playback stream using sox's play command as a subprocess.
cmd := exec.CommandContext(ctx, "play", "-t", "wav", "-")
cmd.Stderr = os.Stderr
outW, err := cmd.StdinPipe()
if err != nil {
fmt.Printf("failed to open playback stream: %v\n", err)
os.Exit(1)
}
eg.Go(func() error {
if err := cmd.Run(); err != nil {
return fmt.Errorf("error encountered in audio playback: %w", err)
}
return nil
})
eg.Go(func() error {
defer outW.Close()
// Receiving audio and writing to playback stream.
for {
resp, err := stream.Recv()
if errors.Is(io.EOF, err) {
return nil
}
if err != nil {
return fmt.Errorf("error encountered while synthesizing: %w", err)
}
audio := resp.GetAudio()
if audio == nil {
return fmt.Errorf("error encountered while synthesizing: server returned nil audio")
}
outW.Write(audio.Data)
}
})
if err := eg.Wait(); err != nil {
fmt.Println(err)
os.Exit(1)
}
}3.1.5 - API Reference
The API is defined as a protobuf spec, so native bindings can be generated in any language with gRPC support. We recommend using buf to generate the bindings.
This section of the documentation is auto-generated from the protobuf spec. The service contains the methods that can be called, and the “messages” are the data structures (objects, classes or structs in the generated code, depending on the language) passed to and from the methods.
Table of Contents
- Table of Contents
- VoiceGenService
- Messages
- Enums
- Scalar Value Types
VoiceGenService
Service that implements the Cobalt VoiceGen API.
Version
Version(VersionRequest) VersionResponse
Returns version information from the server.
ListModels
ListModels(ListModelsRequest) ListModelsResponse
ListModels returns information about the models the server can access.
StreamingSynthesize
StreamingSynthesize(StreamingSynthesizeRequest) StreamingSynthesizeResponse
Performs text to speech synthesis and stream synthesized audio. This method is only available via GRPC and not via HTTP+JSON. However, a web browser may use websockets to use this service.
Messages
- If two or more fields in a message are labeled oneof, then each method call using that message must have exactly one of the fields populated
- If a field is labeled
repeated, then the generated code will accept an array (or struct, or list depending on the language).
AudioFormat
Details of audio in format
Fields
-
sample_rate (uint32 ) Sampling rate in Hz.
-
channels (uint32 ) Number of channels present in the audio. E.g.: 1 (mono), 2 (stereo), etc.
-
bit_depth (uint32 ) Bit depth of each sample (e.g. 8, 16, 24, 32, etc.).
-
codec (AudioCodec ) Codec of the samples.
-
encoding (AudioEncoding ) Encoding of the samples.
-
byte_order (ByteOrder ) Byte order of the samples. This field must be set to a value other than
BYTE_ORDER_UNSPECIFIEDwhen thebit_depthis greater than 8.
ListModelsRequest
The top-level message sent by the client for the ListModels method.
ListModelsResponse
The message returned to the client by the ListModels method.
Fields
- models (ModelInfo repeated) List of models available for use on Privacy Screen server.
ModelAttributes
Attributes of a VoiceGen Model
Fields
-
language (string ) Language of the model.
-
phone_set (PhoneSet ) The set of phonemes this model uses to represent how words should be pronounced.
-
native_audio_format (AudioFormat ) Native audio format of the model. This will be use as default value if audio format in
SynthesisConfigis not specify. -
supported_features (ModelFeatures ) Supported model features.
-
speakers (SpeakerInfo repeated) List of speaker available for use in this model.
ModelFeatures
Fields
-
speech_rate (bool ) This is set to true if the model can be configured to synthesize audio at different talking speeds.
-
variation_scale (bool ) This is set to true if the model can be configured to synthesize audio for a given text input differently than usual by varying stresses, and emphasis on different parts of the audio. This feature is useful for making the audio sound slightly different each time to avoid making it feel monotonous.
ModelInfo
Description of a Cobalt VoiceGen Model
Fields
-
id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for synthesis, and is specified in the
SynthesisConfigmessage. -
name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their synthesis task.
-
attributes (ModelAttributes ) Model attributes.
SpeakerAttributes
Attributes of a speaker
Fields
- language (string ) Language of the speaker. This can be different from model language. E.g. an english model with different accents: en-US, en-GB, en-IN etc.
SpeakerInfo
Description of a speaker
Fields
-
id (string ) Unique identifier of the speaker. This identifier is used to choose the speaker that should be used for synthesis, and is specified in the
SynthesisConfigmessage. -
name (string ) Speaker name. This is a concise name describing the speaker, and may be presented to the end-user, for example, to help choose which speaker to use for their synthesis task.
-
description (string ) Speaker description. This is may be presented to the end-user, for example, to help choose which speaker to use for their synthesis task.
-
attributes (SpeakerAttributes ) Speaker attributes.
StreamingSynthesizeRequest
The top-level messages sent by the client for the StreamingSynthesize
method.
Fields
-
config (SynthesisConfig )
-
text (SynthesisText )
StreamingSynthesizeResponse
The top-level message sent by the server for the StreamingSynthesize
method. In this streaming call, multiple StreamingSynthesizeResponse
messages contain SynthesizedAudio.
Fields
- audio (SynthesizedAudio )
SynthesisConfig
Configuration for setting up a Synthesizer
Fields
-
model_id (string ) Unique identifier of the model to use, as obtained from a
ModelInfomessage. -
speaker_id (string ) Unique identifier of the speaker to use, as obtained from a
SpeakerInfomessage. -
audio_format (AudioFormat ) Format of the audio to be sent for synthesis. If no value specify, default value of native audio format of the specified model will be used. Native audio format can be obtained from
ModelAttributesmessage. -
speech_rate (float ) The speech rate for synthesized audio. If unset, then the default speech rate of a given model is used. Otherwise a value > 0 should be used, with higher values resulting in faster speech. This field only has an effect on the synthesized audio if the model supports it, which can be ascertained from the
ModelAttributes.supported_features. -
variation_scale (float ) A scale with values > 0, to determine how much to randomly vary the synthesized audio by altering stresses and emphasis on different parts of the audio. Higher values correspond to greater variation. This field only has an affect on the synthesized audio if the model supports it, which can be ascertained from the
ModelAttributes.supported_features.
SynthesisText
Text input to be sent to the synthesizer
Fields
- text (string )
SynthesizedAudio
Synthesize audio from the synthesizer
Fields
- data (bytes )
VersionRequest
The top-level message sent by the client for the Version method.
VersionResponse
The top-level message sent by the server for the Version method.
Fields
- version (string ) Version of the server handling these requests.
Enums
AudioCodec
The encoding of the audio data to be sent for synthesis.
| Name | Number | Description |
|---|---|---|
| AUDIO_CODEC_UNSPECIFIED | 0 | AUDIO_CODEC_UNSPECIFIED is the default value of this type. |
| AUDIO_CODEC_RAW | 2 | Raw data without any headers |
| AUDIO_CODEC_WAV | 1 | WAV with RIFF headers |
AudioEncoding
The encoding of the audio data to be sent for synthesis.
| Name | Number | Description |
|---|---|---|
| AUDIO_ENCODING_UNSPECIFIED | 0 | AUDIO_ENCODING_UNSPECIFIED is the default value of this type and will result in an error. |
| AUDIO_ENCODING_SIGNED | 1 | PCM signed-integer |
| AUDIO_ENCODING_UNSIGNED | 2 | PCM unsigned-integer |
| AUDIO_ENCODING_IEEE_FLOAT | 3 | PCM IEEE-Float |
| AUDIO_ENCODING_ULAW | 4 | G.711 mu-law |
| AUDIO_ENCODING_ALAW | 5 | G.711 a-law |
ByteOrder
Byte order of multi-byte data
| Name | Number | Description |
|---|---|---|
| BYTE_ORDER_UNSPECIFIED | 0 | BYTE_ORDER_UNSPECIFIED is the default value of this type. |
| BYTE_ORDER_LITTLE_ENDIAN | 1 | Little Endian byte order |
| BYTE_ORDER_BIG_ENDIAN | 2 | Big Endian byte order |
PhoneSet
PhoneSet is a set of phonemes for words pronunciation.
| Name | Number | Description |
|---|---|---|
| PHONE_SET_UNSPECIFIED | 0 | PHONE_SET_UNSPECIFIED is the default value of this type. |
| PHONE_SET_IPA | 1 | IPA phoneme set |
| PHONE_SET_XSAMPA | 2 | X-SAMPA phoneme set |
| PHONE_SET_ARPABET | 3 | ARPAbet phoneme set |
Scalar Value Types
3.1.6 -

Cobalt VoiceGen SDK – Cobalt