This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

VoiceGen

Low latency, on-prem / on-cloud solutions for highly natural streaming text to speech synthesis.

1: Getting Started

2: Generating SDKs

3: Connecting to the Server

4: Streaming Synthesis

5: API Reference

1 - Getting Started

How to get a VoiceGen Server running on your system

Using Cobalt VoiceGen

A typical VoiceGen release, provided as a compressed archive, will contain a linux binary (voicegen-server) for the required native CPU architecture, appropriate Dockerfile and models.
Cobalt VoiceGen runs either locally on linux or using Docker.
Cobalt VoiceGen will serve the GRPC API on port 2727. A web demo will be enabled on port 8080.
To quickly try out VoiceGen, first start the server as shown below and open the web demo at http://localhost:8080 in your browser to input text and play / download synthesized audio. You can also use the SDK in your preferred language to use VoiceGen from the command line or within your application.

Info

The cobalt.license.key file will be provided separately that must be copied into the directory resulting from decompressing the archive. Please do this before running the steps below.

Running VoiceGen Server Locally on Linux

./voicegen-server

By default, the binary assumes the presence of a configuration file, located in the same directory, named: voicegen-server.cfg.toml. A different config file may be specified using the --config argument.

Running VoiceGen Server as a Docker Container

To build and run the Docker image for VoiceGen, run:

docker build -t cobalt-voicegen .
docker run -p 2727:2727 -p 8080:8080 cobalt-voicegen

How to Get a Copy of the VoiceGen Server and Models

The release you will receive is a compressed archive (tar.bz2) and is generally structured accordingly:

release.tar.bz2
├── COPYING
├── README.md
├── voicegen-server
├── voicegen-server.cfg.toml
├── Dockerfile
├── models
│   └── en_US-multispeaker-22050hz
│
└── cobalt.license.key [ provided separately, needs to be copied over ]

The README.md file contains information about this release and instructions for how to start the server on your system.
The voicegen-server is the server program which is configured using the voicegen-server.cfg.toml file.
The Dockerfile can be used to create a container that will let you run VoiceGen server on non-linux systems such as MacOS and Windows.
The models directory contains the speech synthesis models. The content of these directory will depend on the models you are provided.

System Requirements

Cobalt VoiceGen runs on Linux. You can run it directly as a linux application.

You can evaluate the product on Windows or Linux using Docker Desktop but we would not recommend this setup for use in a production environment.

A Cobalt VoiceGen release typically includes a single model together with binaries and config files. VoiceGen models may take up to 250MB of disk space, and need a minimum of 2GB RAM when evaluating locally. For production workloads, we recommend configuring containerized applications with each instance allocated with 4 CPUs and 4GB RAM.

Cobalt VoiceGen runs on x86_64 CPUs. We also support Arm64 CPUs, including processors such as the Graviton (AWS c7g EC2 instances). VoiceGen is significantly more cost effective to run on C7g instances compared to similarly sized Intel or AMD processors, and we can provide you an Arm64 release on request.

To integrate Cobalt VoiceGen into your application, please follow the next steps to install or generate the SDK in a language of your choice.

2 - Generating SDKs

Gives instructions about how to generate an SDK for your project from the proto API definition.

APIs for all Cobalt’s services are defined as a protocol buffer specification or simply a proto file and be found in the cobaltspeech/proto github repository.
The proto file allows a developer to auto-generate client SDKs for a number of different programming languages. Step by step instructions for generating your own SDK can be found below.
We provide pre-generated SDKs for a couple of languages. You can choose to use these instead of generating your own. These are listed here along with instructions on how to install / import them into your projects.

Pre-generated SDKs

Golang

Pre-generated SDK files for Golang can be found in the cobaltspeech/go-genproto repo
To use it in your Go project, simply import it:

import voicegenpb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicegen/v1"

An example client using the above repo can be found here.

Python

Pre-generated SDK files for Python can be found in the cobaltspeech/py-genproto repo
The Python SDK depends on Python >= 3.5. You may use pip to perform a system-wide install, or use virtualenv for a local install. To use it in your Python project, install it:

pip install --upgrade pip
pip install "git+https://github.com/cobaltspeech/py-genproto"

Generating SDKs

Step 1. Installing `buf`

To work with proto files, we recommend using buf, a user-friendly command line tool that can be configured generate documentation, schemas and SDK code for different languages.

# Latest version as of March 14th, 2023.

COBALT="${HOME}/cobalt"
  mkdir -p "${COBALT}/bin"

VERSION="1.15.1"
URL="https://github.com/bufbuild/buf/releases/download/v${VERSION}/buf-$(uname -s)-$(uname -m)"
  curl -L ${URL} -o "${COBALT}/bin/buf"

# Give executable permissions and adding to $PATH.

chmod +x "${COBALT}/bin/buf"
  export PATH="${PATH}:${COBALT}/bin"

brew install bufbuild/buf/buf

Step 2. Getting `proto` files

Clone the cobaltspeech/proto repository:

COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/git"

# Change this to where you want to clone the repo to.
PROTO_REPO="${COBALT}/git/proto"

git clone https://github.com/cobaltspeech/proto "${PROTO_REPO}"

Step 3. Generating code

The cobaltspeech/proto repo provides a buf.gen.yaml config file to get you started with a couple of languages.
Other plugins can be added to the buf.gen.yaml file to generate SDK code for more languages.
To generate the SDKs, simply run the following (assuming the buf binary is in your $PATH)

cd "${PROTO_REPO}"

# Removing any previously generated files.
rm -rf ./gen

# Generating code for all proto files inside the `proto` directory.
buf generate proto

You should now have a folder called gen inside ${PROTO_REPO} that contains the generated code. The latest version of the VoiceGen API is v1. You can import / include / copy the generated files into your projects as per the conventions of different languages.

Python
Golang

gen
├── ... other languages ...
└── py
  └── cobaltspeech
    ├── ... other services ...
    └── voicegen
      └── v1
        ├── voicegen_pb2_grpc.py
        ├── voicegen_pb2.py
        └── voicegen_pb2.pyi

gen
├── ... other languages ...
└── go
   ├── cobaltspeech
   │ ├── ...
   │   └── voicegen
   │      └── v1
   │        ├── voicegen_grpc.pb.go
   │        └── voicegen.pb.go
   └── gw
     └── cobaltspeech
       ├── ...
       └── voicegen
         └── v1
            └── voicegen.pb.gw.go

Step 4. Installing gPRC and protobuf

A couple of gRPC and protobuf dependencies are required along with the code generated above. The method of installing them depends on the programming language being used.
These dependencies and the most common way of installing/ / including them are listed below for some chosen languages.

# It is encouraged to this inside a python virtual environment

# to avoid creating version conflicts for other scripts that may

# be using these libraries.

pip install --upgrade protobuf
pip install --upgrade grpcio
pip install --upgrade google-api-python-client

go get google.golang.org/protobuf
go get google.golang.org/grpc
go get google.golang.org/genproto

# More details on grpc installation can be found at:

# https://grpc.io/docs/languages/cpp/quickstart/

COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/git"

# Latest version as of 14th March, 2023.

VERSION="v1.52.0"
GRPC_REPO="${COBALT}/git/grpc-${VERSION}"

git clone \
 --recurse-submodules --depth 1 --shallow-submodules \
 -b "${VERSION}" \
 https://github.com/grpc/grpc ${GRPC_REPO}

cd "${GRPC_REPO}"
mkdir -p cmake/build

# Change this to where you want to install libprotobuf and libgrpc.

# It is encouraged to install gRPC locally as there is no easy way to

# uninstall gRPC after you’ve installed it globally.

INSTALL_DIR="${COBALT}"

cd cmake/build
cmake \
 -DgRPC_INSTALL=ON \
 -DgRPC_BUILD_TESTS=OFF \
 -DCMAKE_INSTALL_PREFIX=${INSTALL_DIR} \
 ../..

make -j
make install

3 - Connecting to the Server

Describes how to connect to a running Cobalt VoiceGen server instance.

Once you have your VoiceGen server up and running, and have installed or generated the SDK for your project, you can connect to a running instance of VoiceGen server, by “dialing” a gRPC connection.
First, you need to know the address where the server is running: e.g. host:grpc_port. By default, this is localhost:2727 and should be logged to the terminal when you first start VoiceGen server as grpcAddr:

2023/08/14 10:49:38 info  {"license":"Copyright © 2015--present. Cobalt Speech and Language, Inc.  For additional details, including information about open source components used in this software, please see the COPYING file bundled with this program."}
2023/08/14 10:49:38 info  {"msg":"reading config file","path":"configs/voicegen-server.config.toml"}
2023/08/14 10:49:38 info  {"msg":"server initializing"}
2023/08/14 10:49:41 info  {"msg":"server started","grpcAddr":"[::]:2727","httpApiAddr":"[::]:8080","httpOpsAddr":"[::]:8081"}

Info

If you are hosting your server with Transport Layer Security (TLS) enabled, then please follow the instructions under Connection With TLS. Otherwise, you can follow the instructions for the Default Connection method.

Default Connection

The following code snippet connects to the server and queries its version. It connects to the server using an “insecure” gRPC channel. This would be the case if you have just started up a local instance of VoiceGen server without TLS enabled.

Python
Go

import grpc
import cobaltspeech.voicegen.v1.voicegen_pb2_grpc as stub
import cobaltspeech.voicegen.v1.voicegen_pb2 as voicegen

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceGenServiceStub(channel)

# Get server version.
versionResp = client.Version(voicegen.VersionRequest())
print(versionResp)

package main

import (
	"context"
	"fmt"
	"os"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	voicegenpb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicegen/v1"
)

func main() {
	const (
		serverAddress  = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := voicegenpb.NewVoiceGenServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &voicegenpb.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)
}

Connect with TLS

In our recommended setup for deployment, TLS is enabled in the gRPC connection, and when connecting to the server, clients validate the server’s SSL certificate to make sure they are talking to the right party. This is similar to how “https” connections work in web browsers.
The following snippets show how to connect to a VoiceGen Server that has TLS enabled. They use the cobalt’s self-hosted demo server at demo.cobaltspeech.com:2727, but you obviously use your own server instance.

Note

Commercial use of the demo server at demo.cobaltspeech.com:2727 is not permitted. This server is for testing and demonstration purposes only and is not guaranteed to support high availability or high volume. Data uploaded to the server may be stored for internal purposes.

Python
Go

import grpc
import cobaltspeech.voicegen.v1.voicegen_pb2_grpc as stub
import cobaltspeech.voicegen.v1.voicegen_pb2 as voicegen

serverAddress = "demo.cobaltspeech.com:2727"

# Setup a gRPC connection with TLS. You can optionally provide your own
# root certificates and private key to grpc.ssl_channel_credentials()
# for mutually authenticated TLS.
creds = grpc.ssl_channel_credentials()
channel = grpc.secure_channel(serverAddress, creds)
client = stub.VoiceGenServiceStub(channel)

# Get server version.
versionResp = client.Version(voicegen.VersionRequest())
print(versionResp)

package main

import (
	"context"
	"crypto/tls"
	"fmt"
	"os"
	"time"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials"

	voicegenpb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicegen/v1"
)

func main() {
	const (
		serverAddress  = "demo.cobaltspeech.com:2727"
		connectTimeout = 10 * time.Second
	)

	// Setup a gRPC connection with TLS. You can optionally provide your own
	// root certificates and private key through tls.Config for mutually
	// authenticated TLS.
	tlsCfg := tls.Config{}
	creds := credentials.NewTLS(&tlsCfg)

	ctx, cancel := context.WithTimeout(context.Background(), connectTimeout)
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(creds),
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := voicegenpb.NewVoiceGenServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &voicegenpb.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)
}

Client Authentication

In some setups, it may be desired that the server should also validate clients connecting to it and only respond to the ones it can verify. If your VoiceGen server is configured to do client authentication, you will need to present the appropriate certificate and key when connecting to it.
Please note that in the client-authentication mode, the client will still also verify the server’s certificate, and therefore this setup uses mutually authenticated TLS.
The following snippets show how to present client certificates when setting up the credentials. These could then be used in the same way as the examples above to connect to a TLS enabled server.

Python
Go

creds = grpc.ssl_channel_credentials(
  root_certificates=root_certificates,  # PEM certificate as byte string
  private_key=private_key,              # PEM client key as byte string 
  certificate_chain=certificate_chain,  # PEM client certificate as byte string
)

package main

import (
	// ...

	"crypto/tls"
	"crypto/x509"
	"fmt"
	"os"

	// ..
)

func main() {
	// ...

	// Root PEM certificate for validating self-signed server certificate
	var rootCert []byte

	// Client PEM certificate and private key.
	var certPem, keyPem []byte

	caCertPool := x509.NewCertPool()
	if ok := caCertPool.AppendCertsFromPEM(rootCert); !ok {
		fmt.Printf("unable to use given caCert\n")
		os.Exit(1)
	}

	clientCert, err := tls.X509KeyPair(certPem, keyPem)
	if err != nil {
		fmt.Printf("unable to use given client certificate and key: %v\n", err)
		os.Exit(1)
	}

	tlsCfg := tls.Config{
		RootCAs:      caCertPool,
		Certificates: []tls.Certificate{clientCert},
	}

	creds := credentials.NewTLS(&tlsCfg)

	// ...
}

4 - Streaming Synthesis

Describes how to submit text VoiceGen server for streaming synthesis.

The following example shows how to synthesize streaming audio from text using VoiceGen’s StreamingSynthesize request. The audio can be played back as it is being streamed as well as being saved to a file or buffer.

Synthesizing streaming audio and writing to a file

We support streaming several headered file formats including WAV, MP3, FLAC etc. as well streaming raw audio samples. For more details, please see the protocol buffer specification here.
The examples below show how to submit a chunk of text and receive streaming audio which is written to a file. We will query the server for available models and use the first model for synthesis.

Python
Go

import grpc
import cobaltspeech.voicegen.v1.voicegen_pb2_grpc as stub
import cobaltspeech.voicegen.v1.voicegen_pb2 as voicegen

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceGenServiceStub(channel)

# Get server version.
versionResp = client.Version(voicegen.VersionRequest())
print(versionResp)

# Get list of models on the server.
modelResp = client.ListModels(voicegen.ListModelsRequest())

# A model may be a single-speaker model or a multi-speaker model.
# The speakers available for a model will be printed in the model
# attributes below.
print("Models:")
for model in modelResp.models:
    print(model)

# Going with the first model in this example. Also using the first
# speaker available in the model (in case of single-speaker models,
# it is the *only* speaker).
model = modelResp.models[0]
spk = model.attributes.speakers[0]

# Set the synthesis config.
# 
# - We could set speaker_id to None to let the server use the default
#   speaker configured on the server side.
#
# - We are specifying the output audio format to be WAV with 16 bit signed
#   samples, at the model's native sampling rate.
cfg = voicegen.SynthesisConfig(
    model_id=model.id,
    speaker_id=spk.id,
    audio_format=voicegen.AudioFormat(
      codec=voicegen.AUDIO_CODEC_WAV,
	  sample_rate=model.attributes.native_audio_format.sample_rate,
      encoding=voicegen.AUDIO_ENCODING_SIGNED,
      bit_depth=16,
	  channels=1,
      byte_order=voicegen.BYTE_ORDER_LITTLE_ENDIAN,
    ),
)

# Specifying text to synthesize, which could be a single line or multiple paragraphs.
# VoiceGen breaks up the text based on its sentence segmentation algorithm as well as
# any line breaks specified in the input text. We intentionally put line breaks here
# to make it look a bit nicer in the code, which are replaced with spaces.
text = voicegen.SynthesisText(text='''
The world's first 3D printed rocket launched successfully on Wednesday, marking
a step forward for the California company behind the innovative spacecraft,
though it failed to reach orbit.

The successful launch came on the third attempt. It had originally been
scheduled to launch on March 8 but was postponed at the last minute because of
propellant temperature issues. A second attempt on March 11 was scrubbed because of
fuel pressure problems.

Had Terran 1 reached low Earth orbit, it would have been the first privately
funded vehicle using methane fuel to do so on its first try, according to
Relativity.
'''.replace("\n", " "))

# Submitting request to the server and writing streamed audio chunks to file.
print("Synthesizing ...")
with open("output.wav", 'wb') as f:
    for resp in client.StreamingSynthesize(voicegen.StreamingSynthesizeRequest(config=cfg, text=text)):
	    f.write(resp.audio.data)

package main

import (
	"context"
	"errors"
	"fmt"
	"io"
	"os"
	"strings"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	voicegenpb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicegen/v1"
)

func main() {
	const (
		serverAddress = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := voicegenpb.NewVoiceGenServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &voicegenpb.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)

	// Get list model of models on the server.
	modelResp, err := client.ListModels(ctx, &voicegenpb.ListModelsRequest{})
	if err != nil {
		fmt.Printf("failed to get model list: %v\n", err)
		os.Exit(1)
	}

	// A model may be a single-speaker model or a multi-speaker model.
	// The speakers available for a model will be printed in the model
	// attributes below.
	fmt.Println("Models:")
	for _, m := range modelResp.Models {
		fmt.Println(m)
	}
	fmt.Println()

	// Going with the first model in this example. Also using the first
	// speaker available in the model (in case of single-speaker models,
	// it is the *only* speaker).
	model := modelResp.Models[0]
	spk := model.Attributes.Speakers[0]

	// Set the synthesis config.
	//
	//   - We could set speaker_id to None to let the server use the default
	//     speaker configured on the server side.
	//
	//   - We are specifying the output audio format to be WAV with 16 bit signed
	//     samples, at the model's native sampling rate.
	cfg := &voicegenpb.SynthesisConfig{
		ModelId:   model.Id,
		SpeakerId: spk.Id,
		AudioFormat: &voicegenpb.AudioFormat{
			Codec:      voicegenpb.AudioCodec_AUDIO_CODEC_WAV,
			SampleRate: model.Attributes.NativeAudioFormat.SampleRate,
			Encoding:   voicegenpb.AudioEncoding_AUDIO_ENCODING_SIGNED,
			BitDepth:   16,
			Channels:   1,
			ByteOrder:  voicegenpb.ByteOrder_BYTE_ORDER_LITTLE_ENDIAN,
		},
	}

	// Specifying text to synthesize, which could be a single line or multiple paragraphs.
	// VoiceGen breaks up the text based on its sentence segmentation algorithm as well as
	// any line breaks specified in the input text. We intentionally put line breaks here
	// to make it look a bit nicer in the code, which are replaced with spaces.
	text := &voicegenpb.SynthesisText{Text: strings.ReplaceAll(`
The world's first 3D printed rocket launched successfully on Wednesday, marking
a step forward for the California company behind the innovative spacecraft,
though it failed to reach orbit.

The successful launch came on the third attempt. It had originally been
scheduled to launch on March 8 but was postponed at the last minute because of
propellant temperature issues. A second attempt on March 11 was scrubbed because of
fuel pressure problems.

Had Terran 1 reached low Earth orbit, it would have been the first privately
funded vehicle using methane fuel to do so on its first try, according to
Relativity.
`, "\n", " ")}

	// Submitting request to the server and writing streamed audio chunks to file.
	fmt.Println("Synthesizing ...")
	stream, err := client.StreamingSynthesize(context.Background(), &voicegenpb.StreamingSynthesizeRequest{Config: cfg, Text: text})
	if err != nil {
		fmt.Printf("failed to start synthesis stream: %v\n", err)
		os.Exit(1)
	}

	// Opening output audio file.
	outF, err := os.Create("output.wav")
	if err != nil {
		fmt.Printf("failed to open output audio file: %v\n", err)
		os.Exit(1)
	}

	defer outF.Close()

	// Receiving audio and writing to file.
	for {
		resp, err := stream.Recv()
		if errors.Is(io.EOF, err) {
			return
		}

		if err != nil {
			fmt.Printf("error encountered while synthesizing: %v\n", err)
			os.Exit(1)
		}

		audio := resp.GetAudio()
		if audio == nil {
			fmt.Printf("error encountered while synthesizing: server returned nil audio")
			os.Exit(1)
		}

		outF.Write(audio.Data)
	}
}

Synthesizing streaming audio with live playback

The synthesized audio stream can be played back live instead of saving it to a file by writing the data to an appropriate interface that can do the playback; typically this requires interaction with system libraries. Another option is to pipe the audio out to an external command line tool like sox.
The examples below use the latter approach by using the play command provided with sox to play the synthesized audio stream live.

Python
Go

import subprocess
import grpc
import cobaltspeech.voicegen.v1.voicegen_pb2_grpc as stub
import cobaltspeech.voicegen.v1.voicegen_pb2 as voicegen

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceGenServiceStub(channel)

# Get server version.
versionResp = client.Version(voicegen.VersionRequest())
print(versionResp)

# Get list of models on the server.
modelResp = client.ListModels(voicegen.ListModelsRequest())

# A model may be a single-speaker model or a multi-speaker model.
# The speakers available for a model will be printed in the model
# attributes below.
print("Models:")
for model in modelResp.models:
    print(model)

# Going with the first model in this example. Also using the first
# speaker available in the model (in case of single-speaker models,
# it is the *only* speaker).
model = modelResp.models[0]
spk = model.attributes.speakers[0]

# Set the synthesis config.
# 
# - We could set speaker_id to None to let the server use the default
#   speaker configured on the server side.
#
# - We are specifying the output audio format to be WAV with 16 bit signed
#   samples, at the model's native sampling rate.
cfg = voicegen.SynthesisConfig(
    model_id=model.id,
    speaker_id=spk.id,
    audio_format=voicegen.AudioFormat(
      codec=voicegen.AUDIO_CODEC_WAV,
	  sample_rate=model.attributes.native_audio_format.sample_rate,
      encoding=voicegen.AUDIO_ENCODING_SIGNED,
      bit_depth=16,
	  channels=1,
      byte_order=voicegen.BYTE_ORDER_LITTLE_ENDIAN,
    ),
)

# Specifying text to synthesize, which could be a single line or multiple paragraphs.
# VoiceGen breaks up the text based on its sentence segmentation algorithm as well as
# any line breaks specified in the input text. We intentionally put line breaks here
# to make it look a bit nicer in the code, which are replaced with spaces.
text = voicegen.SynthesisText(text='''
The world's first 3D printed rocket launched successfully on Wednesday, marking
a step forward for the California company behind the innovative spacecraft,
though it failed to reach orbit.

The successful launch came on the third attempt. It had originally been
scheduled to launch on March 8 but was postponed at the last minute because of
propellant temperature issues. A second attempt on March 11 was scrubbed because of
fuel pressure problems.

Had Terran 1 reached low Earth orbit, it would have been the first privately
funded vehicle using methane fuel to do so on its first try, according to
Relativity.
'''.replace("\n", " "))

# Open playback stream using sox's play command as subprocess.
cmd = f"play -t wav -"
play = subprocess.Popen(cmd.split(), stdin=subprocess.PIPE)
out = play.stdin

# Submitting request to the server and writing streamed audio chunks to playback stream.
print("Synthesizing ...")
for resp in client.StreamingSynthesize(voicegen.StreamingSynthesizeRequest(config=cfg, text=text)):
    out.write(resp.audio.data)

out.close()
play.wait()
play.kill()

package main

import (
	"context"
	"errors"
	"fmt"
	"io"
	"os"
	"os/exec"
	"strings"

	"golang.org/x/sync/errgroup"
	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	voicegenpb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicegen/v1"
)

func main() {
	const (
		serverAddress = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := voicegenpb.NewVoiceGenServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &voicegenpb.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)

	// Get list model of models on the server.
	modelResp, err := client.ListModels(ctx, &voicegenpb.ListModelsRequest{})
	if err != nil {
		fmt.Printf("failed to get model list: %v\n", err)
		os.Exit(1)
	}

	// A model may be a single-speaker model or a multi-speaker model.
	// The speakers available for a model will be printed in the model
	// attributes below.
	fmt.Println("Models:")
	for _, m := range modelResp.Models {
		fmt.Println(m)
	}
	fmt.Println()

	// Going with the first model in this example. Also using the first
	// speaker available in the model (in case of single-speaker models,
	// it is the *only* speaker).
	model := modelResp.Models[0]
	spk := model.Attributes.Speakers[0]

	// Set the synthesis config.
	//
	//   - We could set speaker_id to None to let the server use the default
	//     speaker configured on the server side.
	//
	//   - We are specifying the output audio format to be WAV with 16 bit signed
	//     samples, at the model's native sampling rate.
	cfg := &voicegenpb.SynthesisConfig{
		ModelId:   model.Id,
		SpeakerId: spk.Id,
		AudioFormat: &voicegenpb.AudioFormat{
			Codec:      voicegenpb.AudioCodec_AUDIO_CODEC_WAV,
			SampleRate: model.Attributes.NativeAudioFormat.SampleRate,
			Encoding:   voicegenpb.AudioEncoding_AUDIO_ENCODING_SIGNED,
			BitDepth:   16,
			Channels:   1,
			ByteOrder:  voicegenpb.ByteOrder_BYTE_ORDER_LITTLE_ENDIAN,
		},
	}

	// Specifying text to synthesize, which could be a single line or multiple paragraphs.
	// VoiceGen breaks up the text based on its sentence segmentation algorithm as well as
	// any line breaks specified in the input text. We intentionally put line breaks here
	// to make it look a bit nicer in the code, which are replaced with spaces.
	text := &voicegenpb.SynthesisText{Text: strings.ReplaceAll(`
The world's first 3D printed rocket launched successfully on Wednesday, marking
a step forward for the California company behind the innovative spacecraft,
though it failed to reach orbit.

The successful launch came on the third attempt. It had originally been
scheduled to launch on March 8 but was postponed at the last minute because of
propellant temperature issues. A second attempt on March 11 was scrubbed because of
fuel pressure problems.

Had Terran 1 reached low Earth orbit, it would have been the first privately
funded vehicle using methane fuel to do so on its first try, according to
Relativity.
`, "\n", " ")}

	// Starting routines to receive audio from server and write to playback stream;
	// using an errgroup.Group that returns if either one encounters an error.
	eg, ctx := errgroup.WithContext(context.Background())

	// Submitting request to the server and writing streamed audio chunks to file.
	fmt.Println("Synthesizing ...")
	stream, err := client.StreamingSynthesize(ctx, &voicegenpb.StreamingSynthesizeRequest{Config: cfg, Text: text})
	if err != nil {
		fmt.Printf("failed to start synthesis stream: %v\n", err)
		os.Exit(1)
	}

	// Open playback stream using sox's play command as a subprocess.
	cmd := exec.CommandContext(ctx, "play", "-t", "wav", "-")
	cmd.Stderr = os.Stderr

	outW, err := cmd.StdinPipe()
	if err != nil {
		fmt.Printf("failed to open playback stream: %v\n", err)
		os.Exit(1)
	}

	eg.Go(func() error {
		if err := cmd.Run(); err != nil {
			return fmt.Errorf("error encountered in audio playback: %w", err)
		}

		return nil
	})

	eg.Go(func() error {
		defer outW.Close()

		// Receiving audio and writing to playback stream.
		for {
			resp, err := stream.Recv()
			if errors.Is(io.EOF, err) {
				return nil
			}

			if err != nil {
				return fmt.Errorf("error encountered while synthesizing: %w", err)
			}

			audio := resp.GetAudio()
			if audio == nil {
				return fmt.Errorf("error encountered while synthesizing: server returned nil audio")
			}

			outW.Write(audio.Data)
		}
	})

	if err := eg.Wait(); err != nil {
		fmt.Println(err)
		os.Exit(1)
	}
}

5 - API Reference

Detailed reference for API requests and types.

The API is defined as a protobuf spec, so native bindings can be generated in any language with gRPC support. We recommend using buf to generate the bindings.

This section of the documentation is auto-generated from the protobuf spec. The service contains the methods that can be called, and the “messages” are the data structures (objects, classes or structs in the generated code, depending on the language) passed to and from the methods.

Table of Contents
VoiceGenService
Messages
Enums
Scalar Value Types

VoiceGenService

Service that implements the Cobalt VoiceGen API.

Version

Version(VersionRequest) VersionResponse

Returns version information from the server.

ListModels

ListModels(ListModelsRequest) ListModelsResponse

ListModels returns information about the models the server can access.

StreamingSynthesize

StreamingSynthesize(StreamingSynthesizeRequest) StreamingSynthesizeResponse

Performs text to speech synthesis and stream synthesized audio. This method is only available via GRPC and not via HTTP+JSON. However, a web browser may use websockets to use this service.

Messages

If two or more fields in a message are labeled oneof, then each method call using that message must have exactly one of the fields populated
If a field is labeled repeated, then the generated code will accept an array (or struct, or list depending on the language).

AudioFormat

Details of audio in format

Fields

sample_rate (uint32 ) Sampling rate in Hz.
channels (uint32 ) Number of channels present in the audio. E.g.: 1 (mono), 2 (stereo), etc.
bit_depth (uint32 ) Bit depth of each sample (e.g. 8, 16, 24, 32, etc.).
codec (AudioCodec ) Codec of the samples.
encoding (AudioEncoding ) Encoding of the samples.
byte_order (ByteOrder ) Byte order of the samples. This field must be set to a value other than BYTE_ORDER_UNSPECIFIED when the bit_depth is greater than 8.

ListModelsRequest

The top-level message sent by the client for the ListModels method.

ListModelsResponse

The message returned to the client by the ListModels method.

Fields

models (ModelInfo repeated) List of models available for use on Privacy Screen server.

ModelAttributes

Attributes of a VoiceGen Model

Fields

language (string ) Language of the model.
phone_set (PhoneSet ) The set of phonemes this model uses to represent how words should be pronounced.
native_audio_format (AudioFormat ) Native audio format of the model. This will be use as default value if audio format in SynthesisConfig is not specify.
supported_features (ModelFeatures ) Supported model features.
speakers (SpeakerInfo repeated) List of speaker available for use in this model.

ModelFeatures

Fields

speech_rate (bool ) This is set to true if the model can be configured to synthesize audio at different talking speeds.
variation_scale (bool ) This is set to true if the model can be configured to synthesize audio for a given text input differently than usual by varying stresses, and emphasis on different parts of the audio. This feature is useful for making the audio sound slightly different each time to avoid making it feel monotonous.

ModelInfo

Description of a Cobalt VoiceGen Model

Fields

id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for synthesis, and is specified in the SynthesisConfig message.
name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their synthesis task.
attributes (ModelAttributes ) Model attributes.

SpeakerAttributes

Attributes of a speaker

Fields

language (string ) Language of the speaker. This can be different from model language. E.g. an english model with different accents: en-US, en-GB, en-IN etc.

SpeakerInfo

Description of a speaker

Fields

id (string ) Unique identifier of the speaker. This identifier is used to choose the speaker that should be used for synthesis, and is specified in the SynthesisConfig message.
name (string ) Speaker name. This is a concise name describing the speaker, and may be presented to the end-user, for example, to help choose which speaker to use for their synthesis task.
description (string ) Speaker description. This is may be presented to the end-user, for example, to help choose which speaker to use for their synthesis task.
attributes (SpeakerAttributes ) Speaker attributes.

StreamingSynthesizeRequest

The top-level messages sent by the client for the StreamingSynthesize method.

Fields

config (SynthesisConfig )
text (SynthesisText )

StreamingSynthesizeResponse

The top-level message sent by the server for the StreamingSynthesize method. In this streaming call, multiple StreamingSynthesizeResponse messages contain SynthesizedAudio.

Fields

audio (SynthesizedAudio )

SynthesisConfig

Configuration for setting up a Synthesizer

Fields

model_id (string ) Unique identifier of the model to use, as obtained from a ModelInfo message.
speaker_id (string ) Unique identifier of the speaker to use, as obtained from a SpeakerInfo message.
audio_format (AudioFormat ) Format of the audio to be sent for synthesis. If no value specify, default value of native audio format of the specified model will be used. Native audio format can be obtained from ModelAttributes message.
speech_rate (float ) The speech rate for synthesized audio. If unset, then the default speech rate of a given model is used. Otherwise a value > 0 should be used, with higher values resulting in faster speech. This field only has an effect on the synthesized audio if the model supports it, which can be ascertained from the ModelAttributes.supported_features.
variation_scale (float ) A scale with values > 0, to determine how much to randomly vary the synthesized audio by altering stresses and emphasis on different parts of the audio. Higher values correspond to greater variation. This field only has an affect on the synthesized audio if the model supports it, which can be ascertained from the ModelAttributes.supported_features.

SynthesisText

Text input to be sent to the synthesizer

Fields

text (string )

SynthesizedAudio

Synthesize audio from the synthesizer

Fields

data (bytes )

VersionRequest

The top-level message sent by the client for the Version method.

VersionResponse

The top-level message sent by the server for the Version method.

Fields

version (string ) Version of the server handling these requests.

Enums

AudioCodec

The encoding of the audio data to be sent for synthesis.

Name	Number	Description
AUDIO_CODEC_UNSPECIFIED	0	AUDIO_CODEC_UNSPECIFIED is the default value of this type.
AUDIO_CODEC_RAW	2	Raw data without any headers
AUDIO_CODEC_WAV	1	WAV with RIFF headers

AudioEncoding

The encoding of the audio data to be sent for synthesis.

Name	Number	Description
AUDIO_ENCODING_UNSPECIFIED	0	AUDIO_ENCODING_UNSPECIFIED is the default value of this type and will result in an error.
AUDIO_ENCODING_SIGNED	1	PCM signed-integer
AUDIO_ENCODING_UNSIGNED	2	PCM unsigned-integer
AUDIO_ENCODING_IEEE_FLOAT	3	PCM IEEE-Float
AUDIO_ENCODING_ULAW	4	G.711 mu-law
AUDIO_ENCODING_ALAW	5	G.711 a-law

ByteOrder

Byte order of multi-byte data

Name	Number	Description
BYTE_ORDER_UNSPECIFIED	0	BYTE_ORDER_UNSPECIFIED is the default value of this type.
BYTE_ORDER_LITTLE_ENDIAN	1	Little Endian byte order
BYTE_ORDER_BIG_ENDIAN	2	Big Endian byte order

PhoneSet

PhoneSet is a set of phonemes for words pronunciation.

Name	Number	Description
PHONE_SET_UNSPECIFIED	0	PHONE_SET_UNSPECIFIED is the default value of this type.
PHONE_SET_IPA	1	IPA phoneme set
PHONE_SET_XSAMPA	2	X-SAMPA phoneme set
PHONE_SET_ARPABET	3	ARPAbet phoneme set

Scalar Value Types

.proto Type	C++ Type	C# Type	Go Type	Java Type	PHP Type	Python Type	Ruby Type
double	double	double	float64	double	float	float	Float
float	float	float	float32	float	float	float	Float
int32	int32	int	int32	int	integer	int	Bignum or Fixnum (as required)
int64	int64	long	int64	long	integer/string	int/long	Bignum
uint32	uint32	uint	uint32	int	integer	int/long	Bignum or Fixnum (as required)
uint64	uint64	ulong	uint64	long	integer/string	int/long	Bignum or Fixnum (as required)
sint32	int32	int	int32	int	integer	int	Bignum or Fixnum (as required)
sint64	int64	long	int64	long	integer/string	int/long	Bignum
fixed32	uint32	uint	uint32	int	integer	int	Bignum or Fixnum (as required)
fixed64	uint64	ulong	uint64	long	integer/string	int/long	Bignum
sfixed32	int32	int	int32	int	integer	int	Bignum or Fixnum (as required)
sfixed64	int64	long	int64	long	integer/string	int/long	Bignum
bool	bool	bool	bool	boolean	boolean	boolean	TrueClass/FalseClass
string	string	string	string	String	string	str/unicode	String (UTF-8)
bytes	string	ByteString	[]byte	ByteString	string	str	String (ASCII-8BIT)

6 -

Cobalt VoiceGen SDK – Cobalt

VoiceGen

1 - Getting Started

Using Cobalt VoiceGen

Info

Running VoiceGen Server Locally on Linux

Running VoiceGen Server as a Docker Container

How to Get a Copy of the VoiceGen Server and Models

System Requirements

2 - Generating SDKs

Pre-generated SDKs

Golang

Python

Generating SDKs

Step 1. Installing buf

Step 2. Getting proto files

Step 3. Generating code

Step 4. Installing gPRC and protobuf

3 - Connecting to the Server

Info

Default Connection

Connect with TLS

Note

Client Authentication

4 - Streaming Synthesis

Synthesizing streaming audio and writing to a file

Synthesizing streaming audio with live playback

5 - API Reference

Table of Contents

VoiceGenService

Version

ListModels

StreamingSynthesize

Messages

AudioFormat

Fields

ListModelsRequest

ListModelsResponse

Fields

ModelAttributes

Fields

ModelFeatures

Fields

ModelInfo

Fields

SpeakerAttributes

Fields

SpeakerInfo

Fields

StreamingSynthesizeRequest

Fields

StreamingSynthesizeResponse

Fields

SynthesisConfig

Fields

SynthesisText

Fields

SynthesizedAudio

Fields

VersionRequest

VersionResponse

Fields

Enums

AudioCodec

AudioEncoding

ByteOrder

PhoneSet

Scalar Value Types

6 -

Step 1. Installing `buf`

Step 2. Getting `proto` files