1 - Privacy Screen

Streaming audio and text redaction engine.

Demo

Cobalt’s Privacy Screen engine can redact various categories of sensitive information automatically from text and audio. Every business that collects or deals with personal data should redact sensitive information in order to protect customer privacy, comply with laws and regulations, and discover new business opportunities.

Privacy Screen makes audio and text redaction possible in real-time with the advantage of our low latency and accurate speech recognition engine, Transcribe, combined with a robust redaction backend engine that identifies several types of sensitive or confidential information. There are several categories:

  • Personally Identifying Information (PII) such as names, addresses, phone numbers etc.
  • Protected Health Information (PHI) such as medical conditions, injuries, names of medication etc.
  • Payment Card Industry (PCI) such as credit card and bank details.

A detailed list of all the categories that are identified by Privacy Screen can be found here.

How does redaction work

Sensitive information redaction typically works as a two step process. First, a machine learning model detects and classifies the desired entities in the text. Then, this classification is used to determine if the entity needs to be redacted, and if it does, the entity is replaced with an entity label in the redacted transcript. Currently, Cobalt uses state-of-the-art deep neural network (DNN) model for PII, PHI and PCI redaction.

There are three different options of using Cobalt’s redaction solution:

  • Redact PII from a text transcript
  • Redact PII from an audio file
  • Redact PII from an audio file with a text transcript

Each of these services can be used in two operating modes:

  • Streaming mode: Redaction will run utterance by utterance, and output will be streamed out as soon as the result is ready.
  • Batch mode: All input audio/transcript will be processed, redacted in one batch and the output will be available at the end of the process.

Redact PII from a text transcript

In this use case, you can identify and redact sensitive PII from an input text transcript. Detected PII entities are replaced with an appropriate PII token in the redacted text transcript. Both the input and redacted transcripts are specified as JSON with a list of utterances. Each utterance has a list of words that has:

  • Text
  • Redaction class
  • Redaction confidence score

You can specify the desired redaction classes applicable for your use case in the config file.

Redact PII from an audio file

In this case, the input audio file is first transcribed using Cobalt’s transcribe API and then text redaction is applied on the ASR generated transcript. Detected PII entities are replaced with an appropriate PII token in the redacted text transcript. In the output, you can get:

  • Redacted text transcript
  • Unredacted text transcript
  • Redacted audio file where the PII has been masked with a beep sound The redacted text transcript contains a redaction confidence score, ASR confidence score, and associated starting and ending timestamps for each utterance and/or word.

Redact PII from both an audio file with a text transcript

In this use case, an audio file and associated transcript is given as input in order to get the redacted transcript and redacted audio file as output. The input transcript should be specified as JSON with a list of utterances:

  • Each utterance has:
    • Audio Channel in the audio file. Indexed from 0
    • A list of words. Each word has:
      • Text
      • Timestamp in the audio file where this word starts (in milliseconds)
      • Duration of this word in the audio file (in milliseconds)

Output transcript has the same format as the input, except each word has extra fields such as “redaction_class”, “redaction_confidence”, and “is_redacted”.

Text Redaction

Here is an example of text redaction:

Raw text Redacted text
Good morning, everybody. My name is Robert, and today I am going to share some personal information with you. I live at 123 Park Ave Apt 123 New York City, NY 10002. My Social Security number is 999999999, credit card number is 6666666666666666, and CVV code is 777. I love cats. Good morning, everybody. My name is [NAME], and today I am going to share some personal information with you. I live at [LOCATION_ADDRESS] [LOCATION_CITY], [LOCATION_ZIP]. My Social Security number is [SSN], credit card number is [CREDIT_CARD], and CVV code is [CVV]. I love cats.

System requirements

Minimum requirements

Minimum Recommended (Text only) Recommended (All Features) Recommended Concurrency
CPU Any x86 (Intel or AMD) processor with 6GB RAM and 50GB disk volume Intel Sapphire Rapids or newer CPUs supporting AMX with 16GB RAM and 50GB disk volume Intel Sapphire Rapids or newer CPUs supporting AMX with 64GB RAM and 100GB disk volume
GPU Any x86 (Intel or AMD) processor with 28GB RAM. Nvidia GPU with compute capability 7.0 or higher (Volta or newer) and at least 16GB VRAM. 100GB disk volume Any x86 (Intel or AMD) processor with 32GB RAM and Nvidia Tesla T4 GPU. 100GB disk volume Any x86 (Intel or AMD) processor with 64GB RAM and Nvidia Tesla T4 GPU. 100GB disk volume
Platform Recommended Instance Type (Text only) Recommended Instance Type (All Features)
Azure Standard_E2_v5 (2 vCPUs, 16GB RAM) Standard_E8_v5 (8 vCPUs, 64GB RAM)
AWS M7i.large (2 vCPUs, 8GB RAM) m7i.4xlarge (16 vCPUs, 64GB RAM)
GCP N2-Standard-2 (2 vCPUs, 8GB RAM) N2-Standard-16 (16 vCPUs, 64GB RAM)
Platform Recommended Instance Type (Text only) Recommended Instance Type (All Features)
Azure Standard_NC8as_T4_v3 Standard_NC8as_T4_v3
AWS G4dn.2xlarge G4dn.4xlarge
GCP N1-Standard-8 + Tesla T4 N1-Standard-16 + Tesla T4

1.1 - Server Setup

Describes how to install Privacy Screen on your system.

Installing Cobalt Privacy Screen

Cobalt distributes a docker-compose file that orchestrates three docker images, one for each of the following services:

  • Privacy Screen Server (frontend for accepting text / audio streams)
  • Transcribe Server (for recognizing text in audio files)
  • Redaction Backend Engine (for redacting text data)
flowchart LR; A[SDK] <--> |Audio and/or Text| B[Privacy Screen Server] B[Privacy Screen Server] <--> | Audio | C[Transcribe Server] B[Privacy Screen Server] <--> | Text | D[Redaction Backend Engine]

Having these components as separate images facilitates large deployments where each image can be auto-scaled independently based on request traffic.

Installing Server

  1. Contact Cobalt to get a link to the image files in AWS S3 and the docker-compose configuration file. This link will expire in two weeks, so be sure to download the file to your own server.

  2. Download with the AWS CLI if you have it, or with curl:

    URL="the url sent by Cobalt"
    FILE_NAME="name you want to give the file (should end with the same extension as the url, usually tar.bz2)"
    curl $URL -L -o $FILE_NAME
    
  3. Untar the file, and load the docker images. The tar file will also contain the docker-compose.yaml file.

    tar -xvjf $FILE_NAME -C ./
    docker load < *.bz2
    
  4. Copy the cobalt license file into the server folder

  5. Copy the deid license file into the server folder

  6. Start the services using docker-compose:

    docker-compose up --build
    

The server will be running in the container and listening on port 2728 for gRPC requests from clients.

1.2 - Connecting to the Server

Describes how to connect to a running Cobalt Privacy Screen server instance.

Once you have the Cobalt Privacy Screen server up and running, you are ready to create a client connection.

First, you need to know the address (host:port) where the server is running. This document will assume the values 127.0.0.1:9002, but these can be replaced with your server address in actual code.

Default Connection

The following code snippet connects to the server and queries its version. It uses our recommended default setup, expecting the server to be listening on a TLS encrypted connection.

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/cobaltspeech/sdk-trifid/grpc/go-trifid"
)

const serverAddr = "127.0.0.1:9002"

func main() {
	client, err := trifid.NewClient(serverAddr)
	if err != nil {
		log.Fatal(err)
	}

	// Be sure to close the client when we are done with it.
	defer client.Close()
}
import trifid

client = trifid.Client(server_address="localhost:9002")

Insecure Connection

It is sometimes required to connect to Privacy Screen server without TLS enabled (during debugging, for example). Note that if the server has TLS enabled, attempting to connect with an insecure client will fail.

To create an insecure connection, do the following when creating the client:

client, err := trifid.NewClient(serverAddr, trifid.WithInsecure())
client = trifid.Client(server_address="localhost:9002", insecure=True)

Client Authentication

In our recommended default setup, TLS is enabled in the gRPC setup, and when connecting to the server, clients validate the server’s SSL certificate to make sure they are talking to the right party. This is similar to how “https” connections work in web browsers.

In some setups, it may be desired that the server should also validate clients connecting to it and only respond to the ones it can verify. If your Privacy Screen server is configured to do client authentication, you will need to present the appropriate certificate and key when connecting to it.

Please note that in the client-authentication mode, the client will still also verify the server’s certificate, and therefore this setup uses mutually authenticated TLS. This can be done with:

// certPem and keyPem are the bytes of the client certificate and key
// provided to you.
client, err := trifid.NewClient(serverAddr, trifid.WithClientCert(certPem, keyPem))
# cert_pem and key_pem are the contents of the client certificate and key
# provided to you.
client = trifid.Client(server_address="localhost:9002", client_certificate=cert_pem client_key=key_pem)

Server Information

The client provides two methods to get information about the server - Version and ListModels.

Version

The Version method provides information about the version of the Privacy Screen server the client is connected to, as well as information about other relevant services and packages the server uses.

// Request the server version info
ver, err := client.Version(context.Background())
fmt.Printf("Server Version: %v\n", ver)
# Request the server version info
ver = client.version()
print(f"Server Version: {ver}")

List Models

The ListModels method fetches a list of models available on the Privacy Screen server. On the server side, the models are specified as part of the server’s config file.

// Request the list of models
modelList, err := client.ListModels(context.Background())
fmt.Printf("Available Models:\n")
for _, mdl := range modelList.Models {
	fmt.Printf("  ID: %v\n", mdl.Id)
	fmt.Printf("    Name: %v\n", mdl.Name)
	fmt.Printf("    Redaction Classes: %v\n", mdl.RedactionClasses)
}
# Request the list of models
model_list = client.list_models()
print("Available Models:")
for mdl in model_list:
    print(f"  ID: {mdl.id}")
    print(f"    Name: {mdl.name}")
    print(f"    Redaction Classes: {mdl.redaction_classes}")

1.3 - Text Redaction

Describes how to submit text to Privacy Screen for redaction.

TODO

1.4 - Concurrency

Describes the recommended level of concurrency.

The recommended level of concurrency, i.e. the optimal number of simultaneous requests to make to the container is covered below for the CPU and GPU containers. The recommended concurrency level is driven primarily by the compute requirement of the Neural Network models, such as for PII detection.

CPU

For Neural Network inference workloads, CPUs don’t require inputs to be batched together to achieve good hardware utilization. In practice, due to network overhead and pre/post-processing code it is best to use a low level of concurrency such as 2 per container instance. If latency isn’t a concern, a value of 32 is recommended.

GPU

Unlike CPUs, GPUs require inputs to be batched together and processed as a single large input to achieve optimal hardware utilization. This means that there is a tradeoff between latency and throughput. A concurrency level of 32 per container instance is a good tradeoff between latency and throughput, however concurrency levels as low as 8 do not significantly impact throughput. If latency isn’t a concern, a value of 128 will ensure maximum hardware utilization.

1.5 - Privacy Screen Client

Describes how to use the client binary.

This release includes a PrivacyScreen client that can be used to quickly send audio/transcripts to the server. This reads audio files in WAV (PCM16SLE) format and transcripts in the JSON format, you can see examples for txt and json files:

Jack and Jill went up to 224 North Hill drive to fetch a pail of water. Jack fell down broke his crown and Jill called 4125555555.
{
    "utterances": [
        {
            "start_time_ms": 30,
            "duration_ms": 4230,
            "audio_channel": 0,
            "words": [
                {
                    "start_time_ms": 30,
                    "duration_ms": 390,
                    "text": "Jack"
                },
                {
                    "start_time_ms": 420,
                    "duration_ms": 120,
                    "text": "and"
                },
                {
                    "start_time_ms": 540,
                    "duration_ms": 240,
                    "text": "Jill"
                },
                {
                    "start_time_ms": 780,
                    "duration_ms": 240,
                    "text": "went"
                },
                {
                    "start_time_ms": 1020,
                    "duration_ms": 150,
                    "text": "up"
                },
                {
                    "start_time_ms": 1170,
                    "duration_ms": 60,
                    "text": "to"
                },
                {
                    "start_time_ms": 1230,
                    "duration_ms": 1080,
                    "text": "224"
                },
                {
                    "start_time_ms": 2310,
                    "duration_ms": 300,
                    "text": "North"
                },
                {
                    "start_time_ms": 2610,
                    "duration_ms": 150,
                    "text": "Hill"
                },
                {
                    "start_time_ms": 2760,
                    "duration_ms": 300,
                    "text": "drive"
                },
                {
                    "start_time_ms": 3060,
                    "duration_ms": 90,
                    "text": "to"
                },
                {
                    "start_time_ms": 3150,
                    "duration_ms": 270,
                    "text": "fetch"
                },
                {
                    "start_time_ms": 3420,
                    "duration_ms": 60,
                    "text": "a"
                },
                {
                    "start_time_ms": 3480,
                    "duration_ms": 270,
                    "text": "pail"
                },
                {
                    "start_time_ms": 3750,
                    "duration_ms": 120,
                    "text": "of"
                },
                {
                    "start_time_ms": 3870,
                    "duration_ms": 390,
                    "text": "water."
                }
            ]
        },
        {
            "start_time_ms": 9300,
            "duration_ms": 5324,
            "audio_channel": 1,
            "words": [
                {
                    "start_time_ms": 9300,
                    "duration_ms": 420,
                    "text": "Jack"
                },
                {
                    "start_time_ms": 9720,
                    "duration_ms": 210,
                    "text": "fell"
                },
                {
                    "start_time_ms": 9930,
                    "duration_ms": 420,
                    "text": "down"
                },
                {
                    "start_time_ms": 10410,
                    "duration_ms": 270,
                    "text": "broke"
                },
                {
                    "start_time_ms": 10680,
                    "duration_ms": 150,
                    "text": "his"
                },
                {
                    "start_time_ms": 10830,
                    "duration_ms": 450,
                    "text": "crown"
                },
                {
                    "start_time_ms": 11310,
                    "duration_ms": 180,
                    "text": "and"
                },
                {
                    "start_time_ms": 11490,
                    "duration_ms": 210,
                    "text": "Jill"
                },
                {
                    "start_time_ms": 11700,
                    "duration_ms": 330,
                    "text": "called"
                },
                {
                    "start_time_ms": 12030,
                    "duration_ms": 2594,
                    "text": "4125555555."
                }
            ]
        }
    ]
}

Examples of client calls

There are several ways the client interacts with the server. These examples are always run from the same path as the client binary is. When in doubt, run ./privacy-screen-gprc-client -h to get more information the parameters needed to run the client.

Redact Text

./privacy-screen-grpc-client redact-text \
        --insecure \
        --model-id general \
        --input-text input.txt \
        --output-result redacted_token.json

Redact Transcript

./privacy-screen-grpc-client redact-transcript \
        --insecure \
        --model-id general \
        --input-transcript testdata/input.json \
        --output-transcript redacted_output.json

Redact Transcribed Audio

./privacy-screen-grpc-client redact-transcribed-audio \
        --insecure \
        --model-id general \
        --input-audio testdata/input.wav \
        --input-transcript testdata/input.json \
        --output-audio redacted_output.wav \
        --output-transcript redacted_output.json \
        --timeout 5m

Transcribe and Redact

./privacy-screen-grpc-client transcribe-and-redact \
        --insecure \
        --model-id en_US \
        --input-audio testdata/input.wav \
        --output-audio redacted_output.wav \
        --output-transcript redacted_output.json \
        --output-unredacted-transcript unredacted_output.json \
        --timeout 5m

1.6 - Redaction Categories

Detailed list and examples of categories that are supported.

Personally Identifiable Information (PII)

Label Description Regulatory Compliance
ACCOUNT_NUMBER Customer account or membership identification number
Policy No. 10042992; Member ID: HZ-5235-001

Note: Full support for English; Multilingual support in progress
HIPAA_SAFE_HARBOR, CCI
AGE Numbers associated with an individual’s age
27 years old; 18 months old
More details
When given in years, only the number is flagged, but both number and time unit are flagged when given in other units like months or weeks
Also includes age ranges:
29-35 years old; 18+; A man in his forties
GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI
DATE Specific calendar dates, which can include days of the week, dates, months, or years
Friday, Dec. 18, 2002; Dated: 02/03/97

See also: DATE_INTERVAL, DOB
More details
If no calendar date is specified, days of the week are not flagged:
Your appointment is on Monday
Indexical terms are not flagged:
yesterday; tomorrow
HIPAA_SAFE_HARBOR, Quebec Privacy Act, CCI
DATE_INTERVAL Broader time periods, including date ranges, months, seasons, years, and decades
2020-2021; 5-9 May; January 1984

See also: DATE, DOB

HIPAA_SAFE_HARBOR, CCI
DOB Dates of birth
Born: March 7, 1961

See also: DATE, DATE_INTERVAL

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI
DRIVER_LICENSE Driver's permit numbers
DL# 134711-320

See also: VEHICLE_ID
More detailsIncludes International Driving Permits (IDP) and Pilot’s licenses
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI
DURATION Periods of time, specified as a number and a unit of time
8 months; 2 years


Note: Full support for English; Multilingual support in progress
EMAIL_ADDRESS Email addresses
info@cobaltspeech.com

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI
EVENT Names of events or holidays
Olympics; Yom Kippur

FILENAME Names of computer files, including the extension or filepath
Taxes/2012/brad-tax-returns.pdf

CCI
GENDER Terms indicating gender identity, including slang terms. Note that performance is stronger for terms that are more likely to occur in formal documents, such as "male", "transgender", "non-binary", "female", "M", "F", etc. Other terms, such as "woman", "gentleman", etc., may not be captured in every context.
female; trans

CPRA, GDPR, GDPR Sensitive, APPI Sensitive
HEALTHCARE_NUMBER Healthcare numbers and health plan beneficiary numbers
Policy No.: 5584-486-674-YM
More detailsIncludes medical record numbers, health insurance policy/account numbers, and member IDs, for example, German Sozialversicherungsnummer (also used as SSN), Philippine PhilHealth ID number, Ukrainian VHI number
CPRA, GDPR, HIPAA, Quebec Privacy Act, APPI
IP_ADDRESS Internet IP address, including IPv4 and IPv6 formats
192.168.0.1
2001:db8:0:0:0:8a2e::7334

CPRA, GDPR, HIPAA, Quebec Privacy Act, APPI
LANGUAGE Names of natural languages
Korean; French

GDPR, GDPR Sensitive, APPI Sensitive
LOCATION Metaclass for any named location reference; See subclasses below
Eritrea; Lake Victoria
More details
May co-occur with ORGANIZATION when the context refers explicitly to the organization’s location
The patient was transferred to Northwest General Hospital
GDPR, HIPAA_SAFE_HARBOR, APPI, CCI
LOCATION_ADDRESS Full or partial physical mailing addresses, which can include: building name or number, street, city, county, state, country, zip code
25/300 Adelaide T., Perth WA 6000, Aus.
145 Windsor St.
Mail to: Kollwitzstr 13, 10405, Berlin

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI
LOCATION_ADDRESS_STREET A subclass of LOCATION_ADDRESS, covering: a building number and street name, plus information like a unit numbers, office numbers, floor numbers and building names, where applicable
25/300 Adelaide T., Perth WA 6000, Aus.
145 Windsor St.
Mail to: Kollwitzstr 13, 10405, Berlin

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI
LOCATION_CITY Municipality names, including villages, towns, and cities
Toronto; Berlin; Denpasar

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI
LOCATION_COORDINATE Geographic positions referred to using latitude, longitude, and/or elevation coordinates
We’re at 40.748440 and -73.984559

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI
LOCATION_COUNTRY Country names
Canada; Namibia

GDPR, APPI, CCI
LOCATION_STATE State, province, territory, or prefecture names
Ontario; Arkansas; Ich lebe in NRW

GDPR, APPI, CCI
LOCATION_ZIP Zip codes (including Zip+4), postcodes, or postal codes
90210; B2N 3E3
More details
Optimized for various English-speaking locales (Australia, Canada, United Kingdom, United States), as well as international equivalents
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI
MARITAL_STATUS Terms indicating marital status
single; common-law; ex-wife; married

APPI Sensitive
MONEY Names and/or amounts of currency
15 pesos; $94.50

CCI
NAME Names of individuals, not including personal titles such as ‘Mrs.’ or ‘Mr.’
Dwayne Johnson; Mr. Khanna

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI
NAME_FAMILY Names indicating a person’s family or community; often a last name in Western cultures and first name in Eastern cultures
François Truffaut; Ozu Yasujirō

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI
NAME_GIVEN Names given to an individual, usually at birth; often first / middle names in Western cultures and middle / last names in Eastern cultures
François Truffaut; Ozu Yasujirō

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI
NAME_MEDICAL_ PROFESSIONAL Full names, including professional titles and certifications, of medical professional, such as doctors and nurses
Attending physician: Dr. Kay Martinez, MD

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI
NUMERICAL_PII Numerical PII (including alphanumeric strings) that doesn't fall under other categories. See also a section below on international variants as some of them are mapped to this category, for example, Belgian BTW nummer or European VAT number.
More details
Includes the following: numbers in the medical field, such as device serial numbers, POS codes, NPI numbers, etc.; computer numbers like MAC addresses, cookie IDs, VPNs, error codes, access codes, message IDs, etc.; business-related numbers like DUNS numbers, company registration numbers, provider IDs, etc.; numbers related to purchasing, like order IDs, transaction numbers, confirmation numbers, tracking numbers, etc.; also numbers assigned to various forms of IDs, files, documents, proceedings, invoices, claim IDs, record IDs, etc.
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI
OCCUPATION Job titles or professions
professor; actors; engineer; CPA

Quebec Privacy Act, APPI, CCI
ORGANIZATION Names of organizations or departments within an organization
BHP; McDonald's; LAPD
More details
May co-occur with LOCATION when the context refers explicitly to the organization’s location
Donations can be brought to Royal Canadian Legion Branch 43
Quebec Privacy Act, APPI, CCI
ORGANIZATION_MEDICAL_ FACILITY Names of medical facilities, such as hospitals, clinics, pharmacies, etc.
Northwest General Hospital; Union Family Health Clinic
Quebec Privacy Act, APPI
ORIGIN Terms indicating nationality, ethnicity, or provenance
Canadian; Sri Lankan

CPRA, GDPR, GDPR Sensitive, Quebec Privacy Act, APPI Sensitive
PASSPORT_NUMBER Passport numbers, issued by any country
PA4568332; NU3C6L86S12

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI
PASSWORD Account passwords, PINs, access keys, or verification answers
27%alfalfa; temp1234
My mother's maiden name is Smith

CPRA, APPI, CCI
PHONE_NUMBER Telephone or fax numbers
+4917643476050

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI
PHYSICAL_ATTRIBUTE Distinctive bodily attributes, including terms indicating race
I'm 190cm tall; He belongs to the Black students’ association

CPRA, GDPR, GDPR Sensitive, APPI Sensitive
POLITICAL_AFFILIATION Terms referring to a political party, movement, or ideology
liberal; Republican

CPRA, GDPR, GDPR Sensitive, Quebec Privacy Act, APPI Sensitive
RELIGION Terms indicating religious affiliation
Hindu; Presbyterian

CPRA, GDPR, GDPR Sensitive, Quebec Privacy Act, APPI Sensitive
SEXUALITY Terms indicating sexual orientation, including slang terms
bisexual; gay; straight

CPRA, GDPR, GDPR Sensitive, APPI Sensitive
SSN Social Security Numbers or international equivalent government identification numbers
078-05-1120; ***-***-3256
More details
Includes, for example, Australian TFN, Belgian NISS, British NIN, Canadian SIN, Dutch BSN, German Sozialversicherungsnummer (also used as a healthcare number, see: HEALTHCARE_NUMBER), French INSEE, Indian Aadhaar, Italian TIN, Philippine SSS, Spanish NUSS, Ukrainian TIN, and Mexican NSS formats. Flags mentions of complete numbers as well as the last four digits only.
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI
TIME Expressions indicating clock times
19:37:28; 10pm EST

CCI
URL Internet addresses
www.cobaltspeech.com

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, CCI
USERNAME Usernames, login names, or handles
cobaltspeechandlanguage; @_CobaltSpeechAndLanguage

CPRA, GDPR, APPI
VEHICLE_ID Vehicle identification numbers (VINs), vehicle serial numbers, and license plate numbers
5FNRL38918B111818; BIF7547

See also: DRIVER_LICENSE

CPRA, GDPR, HIPAA_SAFE_HARBOR, APPI, CCI
ZODIAC_SIGN Names of Zodiac signs
Aries; Taurus

Protected Health Information (PHI)

Label Description Regulatory Compliance
BLOOD_TYPE Blood types
She's type AB positive

CPRA, GDPR, Quebec Privacy Act
CONDITION Names of medical conditions, diseases, syndromes, deficits, disorders
chronic fatigue syndrome; arrhythmia; depression

CPRA, GDPR, Quebec Privacy Act, APPI Sensitive
DOSE Medically prescribed quantity of a medication
limit intake to 700 mg/day

DRUG Medications, vitamins, and supplements
advil; Acetaminophen; Panadol

CPRA, GDPR, Quebec Privacy Act, APPI Sensitive, CCI
INJURY Bodily injuries, including mutations, miscarriages, and dislocations
I broke my arm; I have a sprained wrist

CPRA, GDPR, Quebec Privacy Act, APPI Sensitive
MEDICAL_PROCESS Medical processes, including treatments, procedures, and tests
heart surgery; CT scan

CPRA, GDPR, Quebec Privacy Act, APPI Sensitive, CCI
STATISTICS Medical statistics
18% of patients

Quebec Privacy Act

Payment Credit Industry Information (PCI)

Label Description Policy & Regulatory Compliance
BANK_ACCOUNT Bank account numbers and international equivalents, such as IBAN
Acct. No.: 012345-67

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI
CREDIT_CARD Credit card numbers
0123 0123 0123 0123
**** **** ****4252
More details
Includes debit, ATM, Direct Debit, PrePay, Charge Cards, and support for cards that do not have 16 digits such as American Express or China UnionPay cards. Flags mentions of complete numbers as well as the last four digits only.
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI
CREDIT_CARD_ EXPIRATION Expiration date of a credit card
Expires: July 2023; Exp: 02/28

CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI
CVV 3- or 4-digit card verification codes and equivalents
CVV: 080
More details
Includes institution-specific variants:

American Express: CID (card ID), CVD (card verification data) CSC / 3CSC (card security code)
China UnionPay: CVN (card validation number)
CIBC Mastercard: SPC (signature panel code)
Discover: CID (card ID), CVD (card verification data)
ELO (Brazil): CVE (Elo verification code)
JCB (Japan Credit Bureau): CAV (card authentication value)
Mastercard: CVC (card validation code)
VISA: CVV (card verification value)
CPRA, GDPR, HIPAA_SAFE_HARBOR, Quebec Privacy Act, APPI, CCI
ROUTING_NUMBER Routing number associated with a bank or financial institution
012345678
More details
Includes international equivalents: Canadian & British sort codes, Australian BSB numbers, Indian Financial System Codes, Branch/transit numbers, Institution numbers, and Swift codes
CCI

Beta Entity Types

Note that Beta support for the following entity types is currently only available with our English models.

Label Description Regulatory Compliance
CORPORATE_ACTION Any action a company takes that could affect its stock value or its shareholders
Bridge Investment Group LLC (later renamed Bridge Investment Group Holdings LLC); We’ve merged two neighboring retail locations
CCI
FINANCIAL_METRIC Financial metrics or financial ratios are quantitative indicators of a company’s financial health
adjusted earnings per share declined year-over-year; Online sales slow as UK shoppers rein in Christmas spending
CCI
MEDICAL_CODE Codes belonging to medical classification systems such as SNOMED, ICD-10, NDC, etc.
1981-03-11T04:11:32-03:00 Forearm sprain SNOMED-CT 70704007;
<medcode type="string"> R74.8 <desc type="string">Abnormal levels of other serum enzymes
CPRA, GDPR, GDPR Sensitive, Quebec Privacy Act, APPI Sensitive
PRODUCT Names or model numbers of items made by an organization; includes intangible products like software and games, as well as other services
iPhone; Toyota Camry
CCI
TREND A description of the “quality” or the direction in which a financial measurement is going
reflecting the accelerating shift of off-line to online; amid rising costs and shrinking profits
CCI
Each of the beta entities must be enabled explicitly in the deid request. This can be done by setting the redaction class flag in the client: ./privacy-screen-grpc-client redact-text --insecure --model-id general --input-text input.txt --output-result redacted_token.json --redaction-classes="AGE,BANK_ACCOUNT"

International Entity Mapping

In the tables below, you can find localized variants of our entity types. For each entity type, there is a description, an example, and the label under which the entity falls. This section does not include entity types that may vary regionally, but still directly correspond to one of the entities listed above (e.g., PHONE_NUMBER, PASSPORT_NUMBER, DRIVER_LICENSE, LOCATION_ADDRESS, CREDIT_CARD_EXPIRATION). The following numbers are commonly used across many countries, hence are not included in each country’s table: GST (Goods and Services Tax), HST (Harmonized Sales Tax). These numbers are redacted as NUMERICAL_PII.

Asia Pacific

Australia

Identifier PAI Label Description Example
Australian business number (ABN) NUMERICAL_PII A unique 11-digit identifier that every registered business in Australia is required to have 12345678901
Australian Company Number (ACN) NUMERICAL_PII A 9-digit number that must be displayed on all company documents 123 456 789
Bank-State-Branch (BSB) ROUTING_NUMBER A 6-digit number that identifies banks and branches across Australia 123-456
Tax File Number (TFN) SSN A 9-digit personal reference number used for tax and superannuation 456 789 123

China

Identifier PAI Label Description Example
的医保卡号 HEALTHCARE_NUMBER Healthcare number Format varies by provider
纳税人识别号码 SSN Taxpayer identification number consisting of 18 digits for individuals 463728374657483746

India

Identifier PAI Label Description Example
Aadhaar SSN A 12-digit individual identification number used as a proof of identity and address 1234 5678 9123
Financial System Code ROUTING_NUMBER A unique 11-digit alphanumeric code that is used for online fund transfer transactions IDIB000T131
Goods and Services Tax Identification Number (GSTIN) NUMERICAL_PII A unique 15-digit identification number assigned to every taxpayer in India 56HNJCA5424K1DM
Permanent Account Number (PAN) SSN A unique 10-digit tax identification number issued by the Income Tax Department ABCJF54312D

Japan

Identifier PAI Label Description Example
健康保険番号 HEALTHCARE_NUMBER Health insurance number Format varies by provider
マイナンバー (個人番号) SSN My Number (also known as "personal number"), a unique 12-digit number assigned to every resident of Japan, whether Japanese or foreign 123456789888

Korea

Identifier PAI Label Description Example
건강보험증번호 HEALTHCARE_NUMBER Health insurance card number Format varies by provider
주민등록번호 SSN Resident Registration Number used for tax purposes, consists of 13 digits 1236547898745

New Zealand

Identifier PAI Label Description Example
Inland Revenue Department number (IRD) SSN A nine-digit individual identification number issued to each person by the New Zealand Inland Revenue Department, also known as a ‘tax file number’ 099-999-999

Philippines

Identifier PAI Label Description Example
PhilHealth ID number HEALTHCARE_NUMBER 12-digit healthcare identification number 11-455678912-3
Social Security System number (SSS) SSN 10-digit number used for tax purposes 12-3456789-1
Tax Identification Number (TIN) SSN 12-digit number identifying a taxpayer 123 456 789 002

Europe

Identifier PAI Label Description Example
Value-Added Tax (VAT) NUMERICAL_PII A tax applied to all goods and services that are bought and sold for use or consumption in the European Union, formatted as 2 letters (country code) followed by 8-10 digits. Localized names include French "numéro TVA" DK99999999
International Bank Account Number (IBAN) BANK_ACCOUNT An international system of identifying bank accounts across national borders, consists of up to 34 alphanumeric characters including country codes IE12BOFI90000112345678

Belgium

Identifier PAI Label Description Example
Identificatienummer van de Sociale Zekerheid (INSZ) / Numéro d'identification à la sécurité sociale (NISS) SSN National identification number for social security, an 11-digit national registration number, the first 6 digits indicating date of birth 99013187654
Ondernemingsnummer NUMERICAL_PII A unique 10-digit identification number for a business 1987654323
Belasting Toegevoegde Waarde Nummer (BTW Nummer) NUMERICAL_PII An identification number for businesses used for VAT (Value Added Tax) purposes, formated as 2 letters followed by 10 digits BE0784732737

Germany

Identifier PAI Label Description Example
Bankkontonummer BANK_ACCOUNT Bank account number, 10 digits 0532013000
Krankenversicherungs nummer (KVNR) HEALTHCARE_NUMBER An alphanumeric code used for personal identification in Germany's national health insurance system (Krankenversicherung) A123456789
Sozialversicherungs nummer SSN, HEALTHCARE_NUMBER A 12-digit number used to track a person's social security contributions, doubles as a healthcare number 12 123456 A 123
Steuer-Identnummer (St-Nr) SSN A unique 11-digit number assigned to every taxpayer in Germany by the Federal Central Tax Office 12345678909

France

Identifier PAI Label Description Example
Numéro d'Inscription au Répertoire (NIR) SSN A 15-digit ID number commonly known as a numéro de sécurité sociale, also referred to as an Insee number, used for employment and French health benefits 1790223354367-97
Simplification des procedures d’Imposition (SPI) SSN French numéro de fiscal or a numéro SPI, a unique 13-digit tax number issued by the French tax authorities to all residents and non-residents with an obligation to pay tax 12 34 567 891 234
Système d'identification du répertoire des entreprises (SIREN) NUM_PII A 9-digit identifier assigned to every registered business in France by the National Institute of Statistics and Economic Studies (INSEE) 732 829 320

Italy

Identifier PAI Label Description Example
Numero di identificazione fiscale or codice fiscale SSN Tax Identification Number (TIN), a 9-12 digit numeric code 000–123–456–789

Netherlands

Identifier PAI Label Description Example
Burgerservicenummer (BSN) SSN A 9-digit citizen service number 123456789

Portugal

Identifier PAI Label Description Example
Número de Identificação da Segurança Social (NISS) SSN An 11-digit number used to identify individuals in the Portuguese social security system 12354687985

Russia

Identifier PAI Label Description Example
Идентификационный номер налогоплательщика (ИНН) SSN Taxpayer Personal Identification Number, 10-12 digits, used as a social security number 12 34567891 23

Spain

Identifier PAI Label Description Example
Número de la Seguridad Social (NUSS) SSN 11-12 digit social security number 12 34567891 23
Número de Identificación Fiscal (NIF) SSN A 10-character number that is used to interact with the Spanish tax agency X12345678A

Ukraine

Identifier PAI Label Description Example
Ідентифікаційний номер платника податків (ІНПП) SSN A 10-digit Taxpayer Identification Number (TIN) 1234567891

United Kingdom

Identifier PAI Label Description Example
National Insurance Number (NIN) SSN Used in the UK's social security system and tax system, formatted as 2 prefix letters, 6 digits, and 1 suffix letter QQ123456C
Sort code ROUTING_NUMBER Identifies both the bank (in the first digit or first two digits) and the branch where the account is held, usually formatted as 3 pairs of numbers 12-34-56
U.K. Unique Taxpayer Reference Number (UTR) SSN A 10-digit number, also called "tax reference," used in the U.K. when submitting a tax return 12345 67890

North and South America

Brazil

Identifier PAI Label Description Example
Cadastro de Pessoas Físicas (CPF number) SSN Natural Persons Register, an 11-digit number in the format: 000.000.000-00 657.454.244-54

Canada

Identifier PAI Label Description Example
Healthcare number HEALTHCARE_NUMBER Canadian Health Service Numbers, such as such as Care card number, OHIP, etc., required for access to healthcare benefits Format varies by province
Numéro d'assurance sociale (NAS) SSN A 9-digit number that citizens and permanent residents need to work and be paid in Québec; French equivalent of SIN (see below) 365 789 654
Régie de l'assurance maladie du Québec (RAMQ) HEALTHCARE_NUMBER The Québec Health Insurance Number BOUF 9401 1419
Social Insurance Number (SIN) SSN A 9-digit number that citizens and permanent residents need to work and be paid in Canada 321 654 987
Sort code ROUTING_NUMBER A unique 9-digit code that identifies the financial institution (4 digits) and branch of account (5-digit Transit Code) 123456789

Mexico

Identifier PAI Label Description Example
Número de Seguridad Social (NSS) SSN Social Security Number, an 11-digit code 12345678912

United States

Identifier PAI Label Description Example
Healthcare number HEALTHCARE_NUMBER A unique number assigned by a health insurance provider (includes private and government) Format varies by provider
Social Security Number (SSN) SSN A 9-digit number issued to U.S. citizens, permanent residents, and (working) temporary residents 453-65-4543
U.S. Individual Taxpayer Identification Number (ITIN) SSN A 9-digit tax processing number that begins with "9" issued for some categories of population instead of SSN 923-45-6789

1.7 - Redaction Languages

Detailed list of languages that are supported.

Cobalt features core support for 14 languages and extended support for 39 additional languages, with core languages featuring the highest level of performance. The complete list of supported languages below details which languages have core support, which have extended or beta support, and which are upcoming additions. New languages are continually being added, please contact us if you require a language not in the list below.

In addition to supporting 50+ languages, Cobalt offers support for regional language varieties in recognition of the large differences in vocabulary and grammar that can exist in the same language when spoken in different regions. So far, this includes support for varieties of English (US, UK, Canada and Australia), Spanish (Spain and Mexico), French (France and Canada), and Portuguese (Portugal and Brazil). Cobalt also supports code-switching, or mixing of different languages. This means that, in a phrase such as J’ai payé 76,88RM por ein Haarschnitt da 范玉菲 habang ko ay nasa Україна, multilingual PII is accurately de-identified. The selection of supported regional language varieties is continually being expanded, please let us know if there is a specific request.

Cobalt’s supported entity types function across each supported language, with multilingual equivalents of different PII (Personally Identifiable Information) entities, PHI (Protected Health Information) entities, and PCI (Payment Card Industry) entities being detected in each language. Our Supported Entity Types page provides a more detailed look at our coverage of language and region-specific entity equivalents. The solution is also sensitive to cross-linguistic differences in how names are structured, how place names are referred to, and how monetary units are described in different languages, among other differences.

Core Support

Language ISO Code Supported Regional Varieties Support Level Text Support Audio Support File Support Labels
Dutch nl The Netherlands Core
English en Australia, Canada, United Kingdom, United States Core
French fr Canada (Quebec), France, Switzerland Core
German de Germany, Belgium, Austria, Switzerland Core
Hindi hi India Core
Italian it Italy, Switzerland Core
Japanese ja Japan Core
Korean ko Korea Core
Mandarin (simplified) zh-Hans China, Singapore Core
Portuguese pt Brazil, Portugal Core
Russian ru Russia Core
Spanish es Mexico, Spain Core
Tagalog tl Philippines Core
Ukrainian uk Ukraine Core

Extended Support

Language ISO Code Support Level Text Support Audio Support File Support Labels
Afrikaans af Extended
Arabic ar Extended
Bambara bm Extended
Bengali bn Extended
Belarusian be Extended
Bulgarian bg Extended
Burmese my Extended
Cantonese (traditional) zh-Hant Extended
Catalan ca Extended
Croatian hr Extended
Czech cs Extended
Danish da Extended
Estonian et Extended
Finnish fi Extended
Georgian ka Extended
Greek el Extended
Hebrew he Extended
Hungarian hu Extended
Icelandic is Extended
Indonesian id Extended
Khmer km Extended
Latvian lv Extended
Lithuanian lt Extended
Luxembourgish lb Extended
Malay ms Extended
Moldovan ro Extended
Norwegian (Bokmål) nb Extended
Persian (Farsi) fa Extended
Polish pl Extended
Punjabi pa Extended
Romanian ro Extended
Slovak sk Extended
Slovenian sl Extended
Swahili sw Extended
Swedish sv Extended
Tamil ta Extended
Thai th Extended
Turkish tr Extended
Vietnamese vi Extended

1.8 - Prerequisites and System Requirements

Detailed information for the system requirements.

Prerequisites

The following prerequisites are required to run the container:

  • Container engine, such as Docker (can be installed using the official instructions )
  • (GPU only) Nvidia Container Toolkit with Nvidia driver version 515 or higher (can be installed using the following installation guide )

All other dependencies, such as CUDA are included with the container and don’t need to be installed separately.

System Requirements

The image comes in two different build flavours:

  • A compact, CPU-only container that runs on any Intel or AMD CPU and a container with GPU acceleration. The CPU container is highly optimised for the majority of use cases, as the container uses hand-coded AMX/AVX2/AVX512/AVX512 VNNI instructions in conjunction with Neural Network compression techniques to deliver a ~25X speedup over a reference transformer-based system.
  • A GPU container is designed for large-scale deployments making billions of API calls or processing terabytes of data per month.

Minumum Requirements

The minimum system requirements for the container image are as follows:

Minimum Recommended (Text only) Recommended (All Features) Recommended Concurrency
CPU Any x86 (Intel or AMD) processor with 7.5GB free RAM and 50GB disk volume Intel Sapphire Rapids or newer CPUs supporting AMX with 16GB RAM and 50GB disk volume Intel Sapphire Rapids or newer CPUs supporting AMX with 64GB RAM and 100GB disk volume 1
GPU Any x86 (Intel or AMD) processor with 28GB free RAM. Nvidia GPU with compute capability 7.0 or higher (Volta or newer) and at least 16GB VRAM. 100GB disk volume Any x86 (Intel or AMD) processor with 32GB RAM and Nvidia Tesla T4 GPU. 100GB disk volume Any x86 (Intel or AMD) processor with 64GB RAM and Nvidia Tesla T4 GPU. 100GB disk volume 32

While CPU-based container will run on any x86-compatible instance, the below cloud instance types give optimal throughput and latency per dollar:

Platform Recommended Instance Type (Text only) Recommended Instance Type (All Features)
Azure Standard_E2_v5 (2 vCPUs, 16GB RAM) Standard_E8_v5 (8 vCPUs, 64GB RAM)
AWS M7i.large (2 vCPUs, 8GB RAM) m7i.4xlarge (16 vCPUs, 64GB RAM)
GCP N2-Standard-2 (2 vCPUs, 8GB RAM) N2-Standard-16 (16 vCPUs, 64GB RAM)
Notes:
  • In the event when a lower latency is required, the instance type should be scaled; e.g. using an M7i.xlarge in place of a M7i.large. While the Cobalt docker solution can make use of all available CPU cores, it delivers best throughput per dollar using a single CPU core machine. Scaling CPU cores does not result in a linear increase in performance.

Similarly for the GPU-based image, it is recommended the following Nvidia T4 GPU-equipped instance types:

Platform Recommended Instance Type (Text only) Recommended Instance Type (All Features)
Azure Standard_NC8as_T4_v3 Standard_NC8as_T4_v3
AWS G4dn.2xlarge G4dn.4xlarge
GCP N1-Standard-8 + Tesla T4 N1-Standard-16 + Tesla T4

1.9 - Proto API Reference

Detailed reference for API requests and types.

The API is defined as a protobuf spec, so native bindings can be generated in any language with gRPC support. We recommend using buf to generate the bindings.

This section of the documentation is auto-generated from the protobuf spec. The service contains the methods that can be called, and the “messages” are the data structures (objects, classes or structs in the generated code, depending on the language) passed to and from the methods.

TrifidService

Service that implements the Cobalt Trifid Redaction Engine API.

Version

Version(VersionRequest) VersionResponse

Returns version information from the server.

ListModels

ListModels(ListModelsRequest) ListModelsResponse

ListModels returns information about the models the server can access.

RedactText

RedactText(RedactTextRequest) RedactTextResponse Redact text using a redaction engine that is configured with the provided redaction configuration.

RedactTranscript

RedactTranscript(RedactTranscriptRequest) RedactTranscriptResponse

redacts transcript using a redaction engine that is configured with the provided redaction configuration.

StreamingRedactTranscribedAudio

StreamingRedactTranscribedAudio(StreamingRedactTranscribedAudioRequest) StreamingRedactTranscribedAudioResponse

Performs bidirectional streaming redaction on transcribed audio. Receive redacted audio while sending audio. The transcription of audio data must be ready before sending the audio.

StreamingTranscribeAndRedact

StreamingTranscribeAndRedact(StreamingTranscribeAndRedactRequest) StreamingTranscribeAndRedactResponse

Performs bidirectional streaming speech recognition and redaction. Receive redacted audio and transcriptions while sending audio.

Messages

  • If two or more fields in a message are labeled oneof, then each method call using that message must have exactly one of the fields populated
  • If a field is labeled repeated, then the generated code will accept an array (or struct, or list depending on the language).

ListModelsRequest

The top-level message sent by the client for the ListModels method.

ListModelsResponse

The message returned to the client by the ListModels method.

Fields

  • models (ModelInfo repeated) List of models available for use on Trifid server.

ModelInfo

Description of a Trifid Model

Fields

  • id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for recognition, and is specified in the RedactionConfig message.

  • name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their recognition task.

  • redaction_classes (string repeated) List of supported redaction classes.

RedactTextRequest

The top-level message sent by the client for the ListModels method.

Fields

  • redaction_config (RedactionConfig)
  • text (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for recognition, and is specified in the RedactionConfig message.

RedactionConfig

Configuration for setting up a redaction engine.

Fields

  • model_id (string) Unique identifier of the model to use, as obtained from a ModelInfo message.
  • redaction_classes (string repeated) List of whitelisted redaction classes. If the list is empty, server default redaction class list will be considered.
  • disable_streaming (bool ) This is an optional field. If this is set to true, Cobalt Privacy Screen will redact entire transcript at once, by doing so, redaction accuracy will increase at the cost of higher latency. If set to false, Cobalt Privacy Screen will redact one utterance at a time and return the result as soon as possible. The default is false.
  • custom_classes (CustomClasses repeated) This is an optional field. If set, then provided list will be used to extend the list of redaction classes.

CustomClasses

CustomClass allows the client to define a new redaction class. Patterns defined here will override default redaction class for matching tokens.

Fields

  • redaction_class string This is name of the new redaction class. For example, this could be “COMPANY_NAME”.
  • pattern string Pattern defines a Python regex expression that would be used to identify tokens in text that get redacted to this new redaction class. For example, “COBALT|GOOGLE|MICROSOFT”, or more complex patterns such as “^COMPANY-[\d]{4}$”.

ListModelsResponse

The message returned to the client by the ListModels method.

Fields

  • models (ModelInfo repeated) List of models available for use on Trifid server.

ModelInfo

Description of a Trifid Model

Fields

  • id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for recognition, and is specified in the RedactionConfig message.

  • name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their recognition task.

  • redaction_classes (string repeated) List of supported redaction classes.

RedactTranscriptRequest

The top-level messages sent by the client for the RedactTranscript method. Contains redaction config and a transcription to redact.

Fields

RedactTranscriptResponse

The top-level message sent by the server for the RedactTranscript method. Contains redacted transcript.

Fields

RedactionConfig

Configuration for setting up a redaction engine.

Fields

  • model_id (string ) Unique identifier of the model to use, as obtained from a ModelInfo message.

  • redaction_classes (string repeated) List of whitelisted redaction classes. If the list is empty, server default redaction class list will be considered.

  • disable_streaming (bool ) This is an optional field. If this is set to true, Trifid will redact entire transcript at once, by doing so, redaction accuracy will increase at the cost of higher latency. If set to false, Trifid will redact one utterance at a time and return the result as soon as possible. The default is false.

Transcript

Transcript contains multiple utterance of the audio.

Fields

Utterance

Utterance of the audio

Fields

  • text (string ) Text representing the utterance of the audio.

  • audio_channel (uint32 ) Channel of the audio file associate with this utterance. Channels are 0-indexed, so the for mono audio data, this value will always be 0.

  • start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio corresponding to the start of this utterance.

  • duration_ms (uint64 ) Duration in milliseconds of the current utterance in the audio.

  • asr_confidence (double ) ASR confidence estimate between 0 and 1. A higher number represents a higher likelihood of the output being correct.

  • words_info (WordInfo repeated) Word-level information corresponding to the utterance. This field contains word-level timestamps, which are essential as input for audio redaction. This field is only available in an output utterance if enable_word_info was set to true in the RedactionConfig.

StreamingRedactTranscribedAudioRequest

The top-level messages sent by the client for the StreamingRedactTranscribedAudio method. In this streaming call, multiple StreamingRedactTranscribedAudioRequest messages should be sent. The first message must contain a RedactTranscribedAudioConfig message only and all subsequent messages must contain audio data only.

Fields

RedactTranscribedAudioConfig

Configuration for setting up a StreamingRedactTranscribedAudio method.

Fields

  • redaction_config(RedactionConfig) Text redaction config.
  • transcript(Trancript) Transcription of the entire audio. This must be ready before sending the audio.

StreamingRedactTranscribedAudioResponse

The top-level message sent by the server for the StreamingRedactTranscribedAudio method. In this streaming call, multiple StreamingRedactTranscribedAudioResponse messages contain either Utterance or redacted audio data will be returned.

Fields

StreamingTranscribeAndRedactRequest

The top-level messages sent by the client for the StreamingTranscribeAndRedact method. In this streaming call, multiple StreamingTranscribeAndRedactRequest messages should be sent. The first message must contain a TranscribeAndRedactConfig message only and all subsequent messages must contain audio data only.

Fields

TranscribeAndRedactConfig

Configuration for setting up a StreamingTranscribeAndRedact method.

Fields

  • redaction_config(RedactionConfig) Text redaction config.
  • enable_unredacted_transcript(bool) This is an optional field. If this is set to true, each utterance result will include unredacted utterance. If set to false, no unredacted utterance will be returned. The default is false.

StreamingTranscribeAndRedactResponse

The top-level message sent by the server for the StreamingTranscribeAndRedact method. In this streaming call, multiple StreamingTranscribeAndRedactResponse messages contain either TranscribeAndRedactUtterance or redacted audio data will be returned.

Fields

VersionRequest

The top-level message sent by the client for the Version method.

VersionResponse

The top-level message sent by the server for the Version method.

Fields

  • version (string ) Version of the server handling these requests.

WordInfo

Word level details for words in a utterance.

Fields

  • text (string ) The actual word corresponding to the utterance.

  • asr_confidence (double ) ASR confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized.

  • start_time_ms (uint64 ) Time offset in milliseconds relative to the beginning of audio received by the recognizer and corresponding to the start of this spoken word.

  • duration_ms (uint64 ) Duration in milliseconds of the current word in the spoken audio.

  • is_redacted (bool ) If this is set to true, it denotes that the curent word is redacted word or an original word of a redacted word.

  • redaction_class (string ) Recognized redaction class. This is available only if the current word is a redacted word.

  • redaction_confidence (double ) Redactio confidence estimate between 0 and 1. A higher number represents a higher likelihood that the word was correctly recognized. This is available only if the current word is a redacted word.

Enums

Scalar Value Types

.proto Type C++ Type C# Type Go Type Java Type PHP Type Python Type Ruby Type

double
double double float64 double float float Float

float
float float float32 float float float Float

int32
int32 int int32 int integer int Bignum or Fixnum (as required)

int64
int64 long int64 long integer/string int/long Bignum

uint32
uint32 uint uint32 int integer int/long Bignum or Fixnum (as required)

uint64
uint64 ulong uint64 long integer/string int/long Bignum or Fixnum (as required)

sint32
int32 int int32 int integer int Bignum or Fixnum (as required)

sint64
int64 long int64 long integer/string int/long Bignum

fixed32
uint32 uint uint32 int integer int Bignum or Fixnum (as required)

fixed64
uint64 ulong uint64 long integer/string int/long Bignum

sfixed32
int32 int int32 int integer int Bignum or Fixnum (as required)

sfixed64
int64 long int64 long integer/string int/long Bignum

bool
bool bool bool boolean boolean boolean TrueClass/FalseClass

string
string string string String string str/unicode String (UTF-8)

bytes
string ByteString []byte ByteString string str String (ASCII-8BIT)

2 - VoiceBio

Low latency, high accuracy on-prem / on-cloud solutions for speaker verification and identification.

2.1 - Getting Started

How to get a VoiceBio Server running on your system

Using Cobalt VoiceBio

  • A typical VoiceBio release, provided as a compressed archive, will contain a linux binary (voicebio-server) for the required native CPU architecture, appropriate Dockerfile and models.

  • Cobalt VoiceBio runs either locally on linux or using Docker.

  • Cobalt VoiceBio will serve the VoiceBio GRPC API on port 2727.

  • To quickly try out VoiceBio, first start the server as shown below and use the SDK in your preferred language to use VoiceBio from the command line or within your application.

Running VoiceBio Server Locally on Linux

./voicebio-server
  • By default, the binary assumes the presence of a configuration file, located in the same directory, named: voicebio-server.cfg.toml. A different config file may be specified using the --config argument.

Running VoiceBio Server as a Docker Container

To build and run the Docker image for VoiceBio, run:

docker build -t cobalt-voicebio .
docker run -p 2727:2727 -p 8080:8080 cobalt-voicebio

How to Get a Copy of the VoiceBio Server and Models

Contact us for getting a release best suited to your requirements.

The release you will receive is a compressed archive (tar.bz2) and is generally structured accordingly:

release.tar.bz2
├── COPYING
├── README.md
├── voicebio-server
├── voicebio-server.cfg.toml
├── Dockerfile
├── models
│   └── en_US-16khz
│
└── cobalt.license.key [ provided separately, needs to be copied over ]
  • The README.md file contains information about this release and instructions for how to start the server on your system.

  • The voicebio-server is the server program which is configured using the voicebio-server.cfg.toml file.

  • The Dockerfile can be used to create a container that will let you run VoiceBio server on non-linux systems such as MacOS and Windows.

  • The models directory contains the speaker ID models. The content of these directory will depend on the models you are provided.

System Requirements

Cobalt VoiceBio runs on Linux. You can run it directly as a linux application.

You can evaluate the product on Windows or Linux using Docker Desktop but we would not recommend this setup for use in a production environment.

A Cobalt VoiceBio release typically includes a single VoiceBio model together with binaries and config files. The general purpose VoiceBio models take up to 100MB of disk space, and need a minimum of 2GB RAM when evaluating locally. For production workloads, we recommend configuring containerized applications with each instance allocated with 4 CPUs and 4GB RAM.

Cobalt VoiceBio runs on x86_64 CPUs. We also support Arm64 CPUs, including processors such as the Graviton (AWS c7g EC2 instances). VoiceBio is significantly more cost effective to run on C7g instances compared to similarly sized Intel or AMD processors, and we can provide you an Arm64 release on request.

To integrate Cobalt VoiceBio into your application, please follow the next steps to install or generate the SDK in a language of your choice.

2.2 - Generating SDKs

Gives instructions about how to generate an SDK for your project from the proto API definition.
  • APIs for all Cobalt’s services are defined as a protocol buffer specification or simply a proto file and be found in the cobaltspeech/proto github repository.

  • The proto file allows a developer to auto-generate client SDKs for a number of different programming languages. Step by step instructions for generating your own SDK can be found below.

  • We provide pre-generated SDKs for a couple of languages. You can choose to use these instead of generating your own. These are listed here along with instructions on how to install / import them into your projects.

Pre-generated SDKs

Golang

import voicebiopb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
  • An example client using the above repo can be found here.

Python

  • Pre-generated SDK files for Python can be found in the cobaltspeech/py-genproto repo

  • The Python SDK depends on Python >= 3.5. You may use pip to perform a system-wide install, or use virtualenv for a local install. To use it in your Python project, install it:

pip install --upgrade pip
pip install "git+https://github.com/cobaltspeech/py-genproto"

Generating SDKs

Step 1. Installing buf

  • To work with proto files, we recommend using buf, a user-friendly command line tool that can be configured generate documentation, schemas and SDK code for different languages.
# Latest version as of March 14th, 2023.

COBALT="${HOME}/cobalt"
  mkdir -p "${COBALT}/bin"

VERSION="1.15.1"
URL="https://github.com/bufbuild/buf/releases/download/v${VERSION}/buf-$(uname -s)-$(uname -m)"
  curl -L ${URL} -o "${COBALT}/bin/buf"

# Give executable permissions and adding to $PATH.

chmod +x "${COBALT}/bin/buf"
  export PATH="${PATH}:${COBALT}/bin"
brew install bufbuild/buf/buf

Step 2. Getting proto files

COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/git"

# Change this to where you want to clone the repo to.
PROTO_REPO="${COBALT}/git/proto"

git clone https://github.com/cobaltspeech/proto "${PROTO_REPO}"

Step 3. Generating code

  • The cobaltspeech/proto repo provides a buf.gen.yaml config file to get you started with a couple of languages.

  • Other plugins can be added to the buf.gen.yaml file to generate SDK code for more languages.

  • To generate the SDKs, simply run the following (assuming the buf binary is in your $PATH)

cd "${PROTO_REPO}"

# Removing any previously generated files.
rm -rf ./gen

# Generating code for all proto files inside the `proto` directory.
buf generate proto
  • You should now have a folder called gen inside ${PROTO_REPO} that contains the generated code. The latest version of the VoiceBio API is v1. You can import / include / copy the generated files into your projects as per the conventions of different languages.
gen
├── ... other languages ...
└── py
  └── cobaltspeech
    ├── ... other services ...
    └── voicebio
      └── v1
        ├── voicebio_pb2_grpc.py
        ├── voicebio_pb2.py
        └── voicebio_pb2.pyi
gen
├── ... other languages ...
└── go
   ├── cobaltspeech
   │ ├── ...
   │   └── voicebio
   │      └── v1
   │        ├── voicebio_grpc.pb.go
   │        └── voicebio.pb.go
   └── gw
     └── cobaltspeech
       ├── ...
       └── voicebio
         └── v1
            └── voicebio.pb.gw.go

Step 4. Installing gPRC and protobuf

  • A couple of gRPC and protobuf dependencies are required along with the code generated above. The method of installing them depends on the programming language being used.
  • These dependencies and the most common way of installing/ / including them are listed below for some chosen languages.
# It is encouraged to this inside a python virtual environment

# to avoid creating version conflicts for other scripts that may

# be using these libraries.

pip install --upgrade protobuf
pip install --upgrade grpcio
pip install --upgrade google-api-python-client
go get google.golang.org/protobuf
go get google.golang.org/grpc
go get google.golang.org/genproto
# More details on grpc installation can be found at:

# https://grpc.io/docs/languages/cpp/quickstart/

COBALT="${HOME}/cobalt"
mkdir -p "${COBALT}/git"

# Latest version as of 14th March, 2023.

VERSION="v1.52.0"
GRPC_REPO="${COBALT}/git/grpc-${VERSION}"

git clone \
 --recurse-submodules --depth 1 --shallow-submodules \
 -b "${VERSION}" \
 https://github.com/grpc/grpc ${GRPC_REPO}

cd "${GRPC_REPO}"
mkdir -p cmake/build

# Change this to where you want to install libprotobuf and libgrpc.

# It is encouraged to install gRPC locally as there is no easy way to

# uninstall gRPC after you’ve installed it globally.

INSTALL_DIR="${COBALT}"

cd cmake/build
cmake \
 -DgRPC_INSTALL=ON \
 -DgRPC_BUILD_TESTS=OFF \
 -DCMAKE_INSTALL_PREFIX=${INSTALL_DIR} \
 ../..

make -j
make install

2.3 - Connecting to the Server

Describes how to connect to a running Cobalt VoiceBio server instance.
  • Once you have your VoiceBio server up and running, and have installed or generated the SDK for your project, you can connect to a running instance of VoiceBio server, by “dialing” a gRPC connection.

  • First, you need to know the address where the server is running: e.g. host:grpc_port. By default, this is localhost:2727 and should be logged to the terminal when you first start VoiceBio server as grpcAddr:

2023/08/14 10:49:38 info  {"license":"Copyright © 2023--present. Cobalt Speech and Language, Inc.  For additional details, including information about open source components used in this software, please see the COPYING file bundled with this program."}
2023/08/14 10:49:38 info  {"msg":"reading config file","path":"configs/voicebio-server.config.toml"}
2023/08/14 10:49:38 info  {"msg":"server initializing"}
2023/08/14 10:49:38 info  {"msg":"license verified"}
2023/08/14 10:49:41 info  {"msg":"runtime initialized","model_count":"2","init_time_taken":"2.512935646s"}
2023/08/14 10:49:41 info  {"msg":"server started","grpcAddr":"[::]:2727","httpApiAddr":"[::]:8080","httpOpsAddr":"[::]:8081"}

Default Connection

The following code snippet connects to the server and queries its version. It connects to the server using an “insecure” gRPC channel. This would be the case if you have just started up a local instance of VoiceBio server without TLS enabled.

import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)

# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)
package main

import (
	"context"
	"fmt"
	"os"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	voicebiopb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)

func main() {
	const (
		serverAddress  = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := voicebiopb.NewVoiceBioServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &voicebiopb.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)
}

Connect with TLS

  • In our recommended setup for deployment, TLS is enabled in the gRPC connection, and when connecting to the server, clients validate the server’s SSL certificate to make sure they are talking to the right party. This is similar to how “https” connections work in web browsers.

  • The following snippets show how to connect to a VoiceBio Server that has TLS enabled. They use the cobalt’s self-hosted demo server at demo.cobaltspeech.com:2727, but you obviously use your own server instance.

import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio

serverAddress = "demo.cobaltspeech.com:2727"

# Setup a gRPC connection with TLS. You can optionally provide your own
# root certificates and private key to grpc.ssl_channel_credentials()
# for mutually authenticated TLS.
creds = grpc.ssl_channel_credentials()
channel = grpc.secure_channel(serverAddress, creds)
client = stub.VoiceBioServiceStub(channel)

# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)
package main

import (
	"context"
	"crypto/tls"
	"fmt"
	"os"
	"time"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials"

	voicebiopb "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)

func main() {
	const (
		serverAddress  = "demo.cobaltspeech.com:2727"
		connectTimeout = 10 * time.Second
	)

	// Setup a gRPC connection with TLS. You can optionally provide your own
	// root certificates and private key through tls.Config for mutually
	// authenticated TLS.
	tlsCfg := tls.Config{}
	creds := credentials.NewTLS(&tlsCfg)

	ctx, cancel := context.WithTimeout(context.Background(), connectTimeout)
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(creds),
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := voicebiopb.NewVoiceBioServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &voicebiopb.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)
}

Client Authentication

  • In some setups, it may be desired that the server should also validate clients connecting to it and only respond to the ones it can verify. If your VoiceBio server is configured to do client authentication, you will need to present the appropriate certificate and key when connecting to it.

  • Please note that in the client-authentication mode, the client will still also verify the server’s certificate, and therefore this setup uses mutually authenticated TLS.

  • The following snippets show how to present client certificates when setting up the credentials. These could then be used in the same way as the examples above to connect to a TLS enabled server.

creds = grpc.ssl_channel_credentials(
  root_certificates=root_certificates,  # PEM certificate as byte string
  private_key=private_key,              # PEM client key as byte string 
  certificate_chain=certificate_chain,  # PEM client certificate as byte string
)
package main

import (
	// ...

	"crypto/tls"
	"crypto/x509"
	"fmt"
	"os"

	// ..
)

func main() {
	// ...

	// Root PEM certificate for validating self-signed server certificate
	var rootCert []byte

	// Client PEM certificate and private key.
	var certPem, keyPem []byte

	caCertPool := x509.NewCertPool()
	if ok := caCertPool.AppendCertsFromPEM(rootCert); !ok {
		fmt.Printf("unable to use given caCert\n")
		os.Exit(1)
	}

	clientCert, err := tls.X509KeyPair(certPem, keyPem)
	if err != nil {
		fmt.Printf("unable to use given client certificate and key: %v\n", err)
		os.Exit(1)
	}

	tlsCfg := tls.Config{
		RootCAs:      caCertPool,
		Certificates: []tls.Certificate{clientCert},
	}

	creds := credentials.NewTLS(&tlsCfg)

	// ...
}

2.4 - Streaming Enrollment

Describes how to stream audio to VoiceBio server for enrollment.
  • The following example shows how to stream audio using VoiceBio’s StreamingEnroll request and generate a voiceprint. The stream can come from a file on disk or be directly from a microphone in real time.

Streaming from an audio file

  • We support several headered file formats including WAV, MP3, FLAC etc. For more details, please see the protocol buffer specification here. For best accuracy, it is recommended to use an uncompressed / loss-less compression audio format like WAV or FLAC.

  • The examples below use a WAV file as input. We will query the server for available models and use the first model to generate the voiceprint.

  • Generated Voiceprints can be updated and made more robust by re-enrolling them with additional audio. Please see the re-enrollment section.

import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)

# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)

# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())

print("Models:")
for model in modelResp.models:
    print(model)

# Select a model ID from the list above. Going with the first model
# in this example.
modelID = modelResp.models[0].id

# Set the enrollment config. We don't set the audio format and let the
# server auto-detect the format from the file header.
cfg = voicebio.EnrollmentConfig(
    model_id=modelID,
    previous_voiceprint=None,
)

# The first request to the server should only contain the
# configuration. Subsequent requests should contain audio
# bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
    yield voicebio.StreamingEnrollRequest(config=cfg)
    
    data = audio.read(bufferSize)
    while len(data) > 0:
        yield voicebio.StreamingEnrollRequest(audio=voicebio.Audio(data=data))
        data = audio.read(bufferSize)

# Streaming audio to the server.
with open("test.wav", "rb") as audio:
  result = client.StreamingEnroll(stream(cfg, audio))

# A certain minimum duration of speech is required for completing enrollment.
# The enrollment status contains information on Whether that has been met or
# whether additional audio is required.  
print(f"enrollment Status:\n{result.enrollment_status}\n")

# Saving the voiceprint data to a file. This can be provided again
# in another StreamingEnroll request (for continuing enrollment) or
# submitted for verification / identification requests.
with open("voiceprint.bin", 'w') as f:
  f.write(result.voiceprint.data)
package main

import (
	"context"
	"errors"
	"fmt"
	"io"
	"os"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)

func main() {
	const (
		serverAddress = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := voicebio.NewVoiceBioServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)

	// Get list model of models on the server.
	modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
	if err != nil {
		fmt.Printf("failed to get model list: %v\n", err)
		os.Exit(1)
	}

	fmt.Println("Models:")
	for _, m := range modelResp.Models {
		fmt.Println(m)
	}
	fmt.Println()

	// Selecting the first model.
	cfg := &voicebio.EnrollmentConfig{
		ModelId:            modelResp.Models[0].Id,
		PreviousVoiceprint: nil,
	}

	// Opening audio file.
	audio, err := os.Open("test.wav")
	if err != nil {
		fmt.Printf("failed to open audio file: %v\n", err)
		os.Exit(1)
	}

	defer audio.Close()

	// Starting enrollment.
	result, err := StreamingEnroll(ctx, client, cfg, audio)
	if err != nil {
		fmt.Printf("failed to run streaming enrollment: %v\n", err)
		os.Exit(1)
	}

	// A certain minimum duration of speech is required for completing enrollment.
	// The enrollment status contains information on Whether that has been met or
	// whether additional audio is required.
	fmt.Printf("Enrollment Status: %v\n", result.EnrollmentStatus)

	// Saving the voiceprint data to a file. This can be provided again
	// in another StreamingEnroll request (for continuing enrollment) or
	// submitted for verification / identification requests.
	if err := os.WriteFile("voiceprint.bin", []byte(result.Voiceprint.Data), os.ModePerm); err != nil {
		fmt.Printf("failed to write voiceprint data: %v\n", err)
		os.Exit(1)
	}
}

// StreamingEnroll wraps the streaming API for performing speaker enrollment
// (i.e. voiceprint generation) using  the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to VoiceBio
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
func StreamingEnroll(
	ctx context.Context,
	client voicebio.VoiceBioServiceClient,
	cfg *voicebio.EnrollmentConfig,
	audio io.Reader,
) (*voicebio.StreamingEnrollResponse, error) {
	const (
		streamingBufSize = 1024
	)

	// Creating stream.
	stream, err := client.StreamingEnroll(ctx)
	if err != nil {
		return nil, err
	}

	// Sending audio.
	if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
		// if sendAudio encountered io.EOF, it's only a
		// notification that the stream has closed.  The actual
		// status will be obtained in the CloseAndRecv call. We
		// therefore return on non-EOF errors here.
		return nil, err
	}

	// Returning result.
	return stream.CloseAndRecv()
}

// sendAudio sends the config and audio to a stream.
func sendAudio(
	stream voicebio.VoiceBioService_StreamingEnrollClient,
	cfg *voicebio.EnrollmentConfig,
	audio io.Reader,
	bufSize uint32,
) error {
	// The first message needs to be a config message, and all subsequent
	// messages must be audio messages.

	// Send the config.
	if err := stream.Send(&voicebio.StreamingEnrollRequest{
		Request: &voicebio.StreamingEnrollRequest_Config{Config: cfg},
	}); err != nil {
		// if this failed, we don't need to CloseSend
		return err
	}

	// Stream the audio.
	buf := make([]byte, bufSize)
	for {
		n, err := audio.Read(buf)
		if n > 0 {
			if err2 := stream.Send(&voicebio.StreamingEnrollRequest{
				Request: &voicebio.StreamingEnrollRequest_Audio{
					Audio: &voicebio.Audio{Data: buf[:n]},
				},
			}); err2 != nil {
				// if we couldn't Send, the stream has
				// encountered an error and we don't need to
				// CloseSend.
				return err2
			}
		}

		if err != nil {
			// err could be io.EOF, or some other error reading from
			// audio.  In any case, we need to CloseSend, send the
			// appropriate error to errCh and return from the function
			if err2 := stream.CloseSend(); err2 != nil {
				return err2
			}

			if err != io.EOF {
				return err
			}

			return nil
		}
	}
}

Streaming from microphone

  • Streaming audio from microphone input basically requires a reader interface that can provided audio samples recorded from a microphone; typically this requires interaction with system libraries. Another option is to use an external command line tool like sox to record and pipe audio into the client.

  • The examples below use the latter approach by using the rec command provided with sox to record and stream the audio.

#!/usr/bin/env python3

# This example assumes sox is installed on the system and is available
# in the system's PATH variable. Instead of opening a regular file from
# disk, we open a subprocess that executes sox's rec command to record
# audio from the system's default microphone.

import subprocess
import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)

# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)

# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())

print("Models:")
for model in modelResp.models:
    print(model)

# Select a model ID from the list above. Going with the first model
# in this example.
m = modelResp.models[0]
modelID = m.id

# Setting audio format to be raw 16-bit signed little endian audio samples
# recorded at the sample rate expected by the model.
cfg = voicebio.EnrollmentConfig(
    model_id=modelID,
    previous_voiceprint=None,
    audio_format=voicebio.AudioFormat(
      audio_format_raw=voicebio.AudioFormatRAW(
        encoding="AUDIO_ENCODING_SIGNED",
        bit_depth=16,
        byte_order="BYTE_ORDER_LITTLE_ENDIAN",
        sample_rate=m.attributes.sample_rate,
        channels=1,
      )
    ),
)

# Open microphone stream using sox's rec command and record
# audio using the config specified above for *10 seconds*.
maxDuration = 10
cmd = f"rec -t raw -r {m.attributes.sample_rate} -e signed -b 16 -L -c 1 - trim 0 {maxDuration}"
mic = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
audio = mic.stdout

try:
  _ = audio.read(1024) # Trying to read some bytes as sanity check.
except Exception as err:
    print(f"[ERROR] failed to read audio from mic stream: {err}")

print(f"\n[INFO] recording {maxDuration} seconds of audio microphone ... \n")

# The first request to the server should only contain the
# recognition configuration. Subsequent requests should contain
# audio bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
    yield voicebio.StreamingEnrollRequest(config=cfg)

    data = audio.read(bufferSize)
    while len(data) > 0:
        yield voicebio.StreamingEnrollRequest(audio=voicebio.Audio(data=data))
        data = audio.read(bufferSize)

# Streaming audio to the server.
result = client.StreamingEnroll(stream(cfg, audio))

# A certain minimum duration of speech is required for completing enrollment.
# The enrollment status contains information on Whether that has been met or
# whether additional audio is required.  
print(f"enrollment Status:\n{result.enrollment_status}\n")

# Saving the voiceprint data to a file. This can be provided again
# in another StreamingEnroll request (for continuing enrollment) or
# submitted for verification / identification requests.
with open("voiceprint.bin", 'w') as f:
  f.write(result.voiceprint.data)

audio.close()
mic.kill()
package main

import (
	"context"
	"errors"
	"fmt"
	"io"
	"os"
	"os/exec"
	"strings"

	"golang.org/x/sync/errgroup"
	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)

func main() {
	const (
		serverAddress = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := voicebio.NewVoiceBioServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)

	// Get list model of models on the server.
	modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
	if err != nil {
		fmt.Printf("failed to get model list: %v\n", err)
		os.Exit(1)
	}

	fmt.Println("Models:")
	for _, m := range modelResp.Models {
		fmt.Println(m)
	}
	fmt.Println()

	// Selecting first model.
	m := modelResp.Models[0]

	// Setting audio format to be raw 16-bit signed little endian audio samples
	// recorded at the sample rate expected by the model.
	cfg := &voicebio.EnrollmentConfig{
		ModelId:            m.Id,
		PreviousVoiceprint: nil,
		AudioFormat: &voicebio.AudioFormat{AudioFormat: &voicebio.AudioFormat_AudioFormatRaw{
			AudioFormatRaw: &voicebio.AudioFormatRAW{
				Encoding:   voicebio.AudioEncoding_AUDIO_ENCODING_SIGNED,
				SampleRate: m.Attributes.SampleRate,
				BitDepth:   16,
				ByteOrder:  voicebio.ByteOrder_BYTE_ORDER_LITTLE_ENDIAN,
				Channels:   1,
			},
		},
		},
	}

	// Open microphone stream using sox's rec command and record
	// audio using the config specified above for *10 seconds*.
	maxDuration := 10
	args := fmt.Sprintf("-t raw -r %d -e signed -b 16 -L -c 1 - trim 0 %d", m.Attributes.SampleRate, maxDuration)
	cmd := exec.CommandContext(ctx, "rec", strings.Fields(args)...)
	cmd.Stderr = os.Stderr

	audio, err := cmd.StdoutPipe()
	if err != nil {
		fmt.Printf("failed to open microphone stream: %v\n", err)
		os.Exit(1)
	}

	// Starting routines to record from microphone and stream to server
	// using an errgroup.Group that returns if either one encounters an error.
	eg, ctx := errgroup.WithContext(ctx)

	eg.Go(func() error {
		fmt.Printf("\n[INFO] recording %d seconds from microphone \n", maxDuration)

		if err := cmd.Run(); err != nil {
			return fmt.Errorf("record from microphone: %w", err)
		}

		return nil
	})

	// Starting enrollment.
	result, err := StreamingEnroll(ctx, client, cfg, audio)
	if err != nil {
		fmt.Printf("failed to run streaming enrollment: %v\n", err)
		os.Exit(1)
	}

	if err := eg.Wait(); err != nil {
		fmt.Printf("%v\n", err)
		os.Exit(1)
	}

	// A certain minimum duration of speech is required for completing enrollment.
	// The enrollment status contains information on Whether that has been met or
	// whether additional audio is required.
	fmt.Printf("Enrollment Status: %v\n", result.EnrollmentStatus)

	// Saving the voiceprint data to a file. This can be provided again
	// in another StreamingEnroll request (for continuing enrollment) or
	// submitted for verification / identification requests.
	if err := os.WriteFile("voiceprint.bin", []byte(result.Voiceprint.Data), os.ModePerm); err != nil {
		fmt.Printf("failed to wriet voiceprint data: %v\n", err)
		os.Exit(1)
	}
}

// StreamingEnroll wraps the streaming API for performing speaker enrollment
// (i.e. voiceprint generation) using  the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to VoiceBio
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
func StreamingEnroll(
	ctx context.Context,
	client voicebio.VoiceBioServiceClient,
	cfg *voicebio.EnrollmentConfig,
	audio io.Reader,
) (*voicebio.StreamingEnrollResponse, error) {
	const (
		streamingBufSize = 1024
	)

	// Creating stream.
	stream, err := client.StreamingEnroll(ctx)
	if err != nil {
		return nil, err
	}

	// Sending audio.
	if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
		// if sendAudio encountered io.EOF, it's only a
		// notification that the stream has closed.  The actual
		// status will be obtained in the CloseAndRecv call. We
		// therefore return on non-EOF errors here.
		return nil, err
	}

	// Returning result.
	return stream.CloseAndRecv()
}

// sendAudio sends audio to a stream.
func sendAudio(
	stream voicebio.VoiceBioService_StreamingEnrollClient,
	cfg *voicebio.EnrollmentConfig,
	audio io.Reader,
	bufSize uint32,
) error {
	// The first message needs to be a config message, and all subsequent
	// messages must be audio messages.

	// Send the config.
	if err := stream.Send(&voicebio.StreamingEnrollRequest{
		Request: &voicebio.StreamingEnrollRequest_Config{Config: cfg},
	}); err != nil {
		// if this failed, we don't need to CloseSend
		return err
	}

	// Stream the audio.
	buf := make([]byte, bufSize)
	for {
		n, err := audio.Read(buf)
		if n > 0 {
			if err2 := stream.Send(&voicebio.StreamingEnrollRequest{
				Request: &voicebio.StreamingEnrollRequest_Audio{
					Audio: &voicebio.Audio{Data: buf[:n]},
				},
			}); err2 != nil {
				// if we couldn't Send, the stream has
				// encountered an error and we don't need to
				// CloseSend.
				return err2
			}
		}

		if err != nil {
			// err could be io.EOF, or some other error reading from
			// audio.  In any case, we need to CloseSend, send the
			// appropriate error to errCh and return from the function
			if err2 := stream.CloseSend(); err2 != nil {
				return err2
			}

			if err != io.EOF {
				return err
			}

			return nil
		}
	}
}

Re-enrollment

  • Voiceprints can be updated and made more robust by re-enrolling them with additional audio. This can be easily done by providing previous voiceprint data in the EnrollmentConfig along with additional audio in a new StreamingEnroll request.
# Connect to server ...

with open("voiceprint.bin", 'r') as f:
  voiceprint = f.read().strip()

cfg = voicebio.EnrollmentConfig(
  model_id=modelID,
  previous_voiceprint=voicebio.Voiceprint(data=voiceprint),
)

# Send audio to server ...
package main

import (
	"context"
	"errors"
	"fmt"
	"io"
	"os"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)

func main() {

	// Connect to server ...

	// Reading old voiceprint data.
	data, err := os.ReadFile("voiceprint.bin")
	if err != nil {
		fmt.Printf("\nfailed to read voiceprint data: %v\n", err)
		os.Exit(1)
	}

	cfg := &voicebio.EnrollmentConfig{
		ModelId:            modelResp.Models[0].Id,
		PreviousVoiceprint: &voicebio.Voiceprint{Data: string(data)},
	}

	// Send audio to server ...
}

2.5 - Streaming Verification

Describes how to stream audio to VoiceBio server for verification against a voiceprint.
  • The following example shows how to stream audio using VoiceBio’s StreamingVerify request and verify whether the audio matches the provided voiceprint. The stream can come from a file on disk or be directly from a microphone in real time.

Streaming from an audio file

  • We support several headered file formats including WAV, MP3, FLAC etc. For more details, please see the protocol buffer specification here. For best accuracy, it is recommended to use an uncompressed / loss-less compression audio format like WAV or FLAC.

  • The examples below use a WAV file as input. We will query the server for available models and use the first model to score and verify given audio against a given voiceprint.

import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)

# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)

# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())

print("Models:")
for model in modelResp.models:
    print(model)

# Select a model ID from the list above. Going with the first model
# in this example.
modelID = modelResp.models[0].id

# Loading reference voiceprint.
with open("voiceprint.bin", 'r') as f:
    voiceprint = voicebio.Voiceprint(data=f.read().strip())

# Set the verification config. We don't set the audio format and let the
# server auto-detect the format from the file header.
cfg = voicebio.VerificationConfig(
    model_id=modelID,
    voiceprint=voiceprint,
)

# The first request to the server should only contain the
# configuration. Subsequent requests should contain audio
# bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
    yield voicebio.StreamingVerifyRequest(config=cfg)
    
    data = audio.read(bufferSize)
    while len(data) > 0:
        yield voicebio.StreamingVerifyRequest(audio=voicebio.Audio(data=data))
        data = audio.read(bufferSize)

# Streaming audio to the server.
with open("test.wav", "rb") as audio:
  resp = client.StreamingVerify(stream(cfg, audio))

# Server returns a similarity score along with whether the score
# exceeded the server-configured threshold for being a match.
print(f"Verification Score: {resp.result.similarity_score:1.3f}, Match: {resp.result.is_match}")
package main

import (
	"context"
	"errors"
	"fmt"
	"io"
	"os"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)

func main() {
	const (
		serverAddress = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := voicebio.NewVoiceBioServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)

	// Get list model of models on the server.
	modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
	if err != nil {
		fmt.Printf("failed to get model list: %v\n", err)
		os.Exit(1)
	}

	fmt.Println("Models:")
	for _, m := range modelResp.Models {
		fmt.Println(m)
	}
	fmt.Println()

	// Reading voiceprint data.
	data, err := os.ReadFile("voiceprint.bin")
	if err != nil {
		fmt.Printf("\nfailed to read voiceprint data: %v\n", err)
		os.Exit(1)
	}

	// Selecting the first model.
	cfg := &voicebio.VerificationConfig{
		ModelId:    modelResp.Models[0].Id,
		Voiceprint: &voicebio.Voiceprint{Data: string(data)},
	}

	// Opening audio file.
	audio, err := os.Open("test.wav")
	if err != nil {
		fmt.Printf("failed to open audio file: %v\n", err)
		os.Exit(1)
	}

	defer audio.Close()

	// Starting verification.
	resp, err := StreamingVerify(ctx, client, cfg, audio)
	if err != nil {
		fmt.Printf("failed to run streaming verification: %v\n", err)
		os.Exit(1)
	}

	// Server returns a similarity score along with whether the score
	// exceeded the server-configured threshold for being a match.
	fmt.Printf("Verification Score: %1.3f, Match: %v\n", resp.Result.SimilarityScore, resp.Result.IsMatch)
}

// StreamingVerify wraps the streaming API for performing speaker verification
// using  the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to VoiceBio
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
func StreamingVerify(
	ctx context.Context,
	client voicebio.VoiceBioServiceClient,
	cfg *voicebio.VerificationConfig,
	audio io.Reader,
) (*voicebio.StreamingVerifyResponse, error) {
	const (
		streamingBufSize = 1024
	)

	// Creating stream.
	stream, err := client.StreamingVerify(ctx)
	if err != nil {
		return nil, err
	}

	// Sending audio.
	if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
		// if sendAudio encountered io.EOF, it's only a
		// notification that the stream has closed.  The actual
		// status will be obtained in the CloseAndRecv call. We
		// therefore return on non-EOF errors here.
		return nil, err
	}

	// Returning result.
	return stream.CloseAndRecv()
}

// sendAudio sends the config and audio to a stream.
func sendAudio(
	stream voicebio.VoiceBioService_StreamingVerifyClient,
	cfg *voicebio.VerificationConfig,
	audio io.Reader,
	bufSize uint32,
) error {
	// The first message needs to be a config message, and all subsequent
	// messages must be audio messages.

	// Send the config.
	if err := stream.Send(&voicebio.StreamingVerifyRequest{
		Request: &voicebio.StreamingVerifyRequest_Config{Config: cfg},
	}); err != nil {
		// if this failed, we don't need to CloseSend
		return err
	}

	// Stream the audio.
	buf := make([]byte, bufSize)
	for {
		n, err := audio.Read(buf)
		if n > 0 {
			if err2 := stream.Send(&voicebio.StreamingVerifyRequest{
				Request: &voicebio.StreamingVerifyRequest_Audio{
					Audio: &voicebio.Audio{Data: buf[:n]},
				},
			}); err2 != nil {
				// if we couldn't Send, the stream has
				// encountered an error and we don't need to
				// CloseSend.
				return err2
			}
		}

		if err != nil {
			// err could be io.EOF, or some other error reading from
			// audio.  In any case, we need to CloseSend, send the
			// appropriate error to errCh and return from the function
			if err2 := stream.CloseSend(); err2 != nil {
				return err2
			}

			if err != io.EOF {
				return err
			}

			return nil
		}
	}
}

Streaming from microphone

  • Streaming audio from microphone input basically requires a reader interface that can provided audio samples recorded from a microphone; typically this requires interaction with system libraries. Another option is to use an external command line tool like sox to record and pipe audio into the client.

  • The examples below use the latter approach by using the rec command provided with sox to record and stream the audio.

#!/usr/bin/env python3

# This example assumes sox is installed on the system and is available
# in the system's PATH variable. Instead of opening a regular file from
# disk, we open a subprocess that executes sox's rec command to record
# audio from the system's default microphone.

import subprocess
import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)

# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)

# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())

print("Models:")
for model in modelResp.models:
    print(model)

# Select a model ID from the list above. Going with the first model
# in this example.
m = modelResp.models[0]
modelID = m.id

# Loading reference voiceprint.
with open("voiceprint.bin", 'r') as f:
    voiceprint = voicebio.Voiceprint(data=f.read().strip())

# Setting audio format to be raw 16-bit signed little endian audio samples
# recorded at the sample rate expected by the model.
cfg = voicebio.VerificationConfig(
    model_id=modelID,
    voiceprint=voiceprint,
    audio_format=voicebio.AudioFormat(
      audio_format_raw=voicebio.AudioFormatRAW(
        encoding="AUDIO_ENCODING_SIGNED",
        bit_depth=16,
        byte_order="BYTE_ORDER_LITTLE_ENDIAN",
        sample_rate=m.attributes.sample_rate,
        channels=1,
      )
    ),
)

# Open microphone stream using sox's rec command and record
# audio using the config specified above for *10 seconds*.
maxDuration = 10
cmd = f"rec -t raw -r {m.attributes.sample_rate} -e signed -b 16 -L -c 1 - trim 0 {maxDuration}"
mic = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
audio = mic.stdout

try:
  _ = audio.read(1024) # Trying to read some bytes as sanity check.
except Exception as err:
    print(f"[ERROR] failed to read audio from mic stream: {err}")

print(f"\n[INFO] recording {maxDuration} seconds of audio microphone ... \n")

# The first request to the server should only contain the
# recognition configuration. Subsequent requests should contain
# audio bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
    yield voicebio.StreamingVerifyRequest(config=cfg)

    data = audio.read(bufferSize)
    while len(data) > 0:
        yield voicebio.StreamingVerifyRequest(audio=voicebio.Audio(data=data))
        data = audio.read(bufferSize)

# Streaming audio to the server.
resp = client.StreamingVerify(stream(cfg, audio))

# Server returns a similarity score along with whether the score
# exceeded the server-configured threshold for being a match.
print(f"Verification Score: {resp.result.similarity_score:1.3f}, Match: {resp.result.is_match}")

audio.close()
mic.kill()
package main

import (
	"context"
	"errors"
	"fmt"
	"io"
	"os"
	"os/exec"
	"strings"

	"golang.org/x/sync/errgroup"
	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)

func main() {
	const (
		serverAddress = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := voicebio.NewVoiceBioServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)

	// Get list model of models on the server.
	modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
	if err != nil {
		fmt.Printf("failed to get model list: %v\n", err)
		os.Exit(1)
	}

	fmt.Println("Models:")
	for _, m := range modelResp.Models {
		fmt.Println(m)
	}
	fmt.Println()

	// Selecting first model.
	m := modelResp.Models[0]

	// Reading voiceprint data.
	data, err := os.ReadFile("voiceprint.bin")
	if err != nil {
		fmt.Printf("\nfailed to read voiceprint data: %v\n", err)
		os.Exit(1)
	}

	// Setting audio format to be raw 16-bit signed little endian audio samples
	// recorded at the sample rate expected by the model.
	cfg := &voicebio.VerificationConfig{
		ModelId:    m.Id,
		Voiceprint: &voicebio.Voiceprint{Data: string(data)},
		AudioFormat: &voicebio.AudioFormat{AudioFormat: &voicebio.AudioFormat_AudioFormatRaw{
			AudioFormatRaw: &voicebio.AudioFormatRAW{
				Encoding:   voicebio.AudioEncoding_AUDIO_ENCODING_SIGNED,
				SampleRate: m.Attributes.SampleRate,
				BitDepth:   16,
				ByteOrder:  voicebio.ByteOrder_BYTE_ORDER_LITTLE_ENDIAN,
				Channels:   1,
			},
		},
		},
	}

	// Open microphone stream using sox's rec command and record
	// audio using the config specified above for *10 seconds*.
	maxDuration := 10
	args := fmt.Sprintf("-t raw -r %d -e signed -b 16 -L -c 1 - trim 0 %d", m.Attributes.SampleRate, maxDuration)
	cmd := exec.CommandContext(ctx, "rec", strings.Fields(args)...)
	cmd.Stderr = os.Stderr

	audio, err := cmd.StdoutPipe()
	if err != nil {
		fmt.Printf("failed to open microphone stream: %v\n", err)
		os.Exit(1)
	}

	// Starting routines to record from microphone and stream to server
	// using an errgroup.Group that returns if either one encounters an error.
	eg, ctx := errgroup.WithContext(ctx)

	eg.Go(func() error {
		fmt.Printf("\n[INFO] recording %d seconds from microphone \n", maxDuration)

		if err := cmd.Run(); err != nil {
			return fmt.Errorf("record from microphone: %w", err)
		}

		return nil
	})

	// Starting verification.
	resp, err := StreamingVerify(ctx, client, cfg, audio)
	if err != nil {
		fmt.Printf("failed to run streaming verification: %v\n", err)
		os.Exit(1)
	}

	// Server returns a similarity score along with whether the score
	// exceeded the server-configured threshold for being a match.
	fmt.Printf("Verification Score: %1.3f, Match: %v\n", resp.Result.SimilarityScore, resp.Result.IsMatch)
}

// StreamingVerify wraps the streaming API for performing speaker verification
// using  the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to VoiceBio
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
func StreamingVerify(
	ctx context.Context,
	client voicebio.VoiceBioServiceClient,
	cfg *voicebio.VerificationConfig,
	audio io.Reader,
) (*voicebio.StreamingVerifyResponse, error) {
	const (
		streamingBufSize = 1024
	)

	// Creating stream.
	stream, err := client.StreamingVerify(ctx)
	if err != nil {
		return nil, err
	}

	// Sending audio.
	if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
		// if sendAudio encountered io.EOF, it's only a
		// notification that the stream has closed.  The actual
		// status will be obtained in the CloseAndRecv call. We
		// therefore return on non-EOF errors here.
		return nil, err
	}

	// Returning result.
	return stream.CloseAndRecv()
}

// sendAudio sends the config and audio to a stream.
func sendAudio(
	stream voicebio.VoiceBioService_StreamingVerifyClient,
	cfg *voicebio.VerificationConfig,
	audio io.Reader,
	bufSize uint32,
) error {
	// The first message needs to be a config message, and all subsequent
	// messages must be audio messages.

	// Send the config.
	if err := stream.Send(&voicebio.StreamingVerifyRequest{
		Request: &voicebio.StreamingVerifyRequest_Config{Config: cfg},
	}); err != nil {
		// if this failed, we don't need to CloseSend
		return err
	}

	// Stream the audio.
	buf := make([]byte, bufSize)
	for {
		n, err := audio.Read(buf)
		if n > 0 {
			if err2 := stream.Send(&voicebio.StreamingVerifyRequest{
				Request: &voicebio.StreamingVerifyRequest_Audio{
					Audio: &voicebio.Audio{Data: buf[:n]},
				},
			}); err2 != nil {
				// if we couldn't Send, the stream has
				// encountered an error and we don't need to
				// CloseSend.
				return err2
			}
		}

		if err != nil {
			// err could be io.EOF, or some other error reading from
			// audio.  In any case, we need to CloseSend, send the
			// appropriate error to errCh and return from the function
			if err2 := stream.CloseSend(); err2 != nil {
				return err2
			}

			if err != io.EOF {
				return err
			}

			return nil
		}
	}
}

2.6 - Streaming Identification

Describes how to stream audio to VoiceBio server for identification using given voiceprints.
  • The following example shows how to stream audio using VoiceBio’s StreamingIdentify request and identify the speaker in the audio using provided voiceprints. The stream can come from a file on disk or be directly from a microphone in real time.

Streaming from an audio file

  • We support several headered file formats including WAV, MP3, FLAC etc. For more details, please see the protocol buffer specification here. For best accuracy, it is recommended to use an uncompressed / loss-less compression audio format like WAV or FLAC.

  • The examples below use a WAV file as input. We will query the server for available models and use the first model to score and identify the given audio against a given set of voiceprints.

import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)

# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)

# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())

print("Models:")
for model in modelResp.models:
    print(model)

# Select a model ID from the list above. Going with the first model
# in this example.
modelID = modelResp.models[0].id

# Loading reference voiceprints.
voiceprints = []
for p in ["user1.bin", "user2.bin", "user3.bin"]:
    with open(p, 'r') as f:
        voiceprints.append(voicebio.Voiceprint(data=f.read().strip()))

# Set the identification config. We don't set the audio format and let the
# server auto-detect the format from the file header.
cfg = voicebio.IdentificationConfig(
    model_id=modelID,
    voiceprints=voiceprints,
)

# The first request to the server should only contain the
# configuration. Subsequent requests should contain audio
# bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
    yield voicebio.StreamingIdentifyRequest(config=cfg)
    
    data = audio.read(bufferSize)
    while len(data) > 0:
        yield voicebio.StreamingIdentifyRequest(audio=voicebio.Audio(data=data))
        data = audio.read(bufferSize)

# Streaming audio to the server.
with open("test.wav", "rb") as audio:
  result = client.StreamingIdentify(stream(cfg, audio))

# Server returns the index of the voiceprint that matches the best, a similarity
# score for each voiceprint along with whether the score exceeded the server-configured
# threshold for being a match.
#
# If none of the voiceprints were a good match, the best match index will be negative.
matched = "❌ No Match found"
if result.best_match_index >= 0:
    best_score = result.voiceprint_comparison_results[result.best_match_index].similarity_score
    matched = f"✅ Match found: Index: {result.best_match_index}, Score: {best_score:1.3f}"

print(f"\nIdentification Result:\n")

print("Scores:")
for i, r in enumerate(result.voiceprint_comparison_results):
    print(f"Index: {i}, Score: {r.similarity_score:1.3f}, IsMatch: {r.is_match}")

print(f"\n{matched}")
package main

import (
	"context"
	"errors"
	"fmt"
	"io"
	"os"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)

func main() {
	const (
		serverAddress = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := voicebio.NewVoiceBioServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)

	// Get list model of models on the server.
	modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
	if err != nil {
		fmt.Printf("failed to get model list: %v\n", err)
		os.Exit(1)
	}

	fmt.Println("Models:")
	for _, m := range modelResp.Models {
		fmt.Println(m)
	}
	fmt.Println()

	// Reading voiceprint data.
	voiceprints := make([]*voicebio.Voiceprint, 0)

	for i, p := range []string{"user1.bin", "user2.bin", "user3.bin"} {
		data, err := os.ReadFile(p)
		if err != nil {
			fmt.Printf("\nfailed to read voiceprint[%d] data: %v\n", i, err)
			os.Exit(1)
		}

		voiceprints = append(voiceprints, &voicebio.Voiceprint{Data: string(data)})
	}

	// Selecting the first model.
	cfg := &voicebio.IdentificationConfig{
		ModelId:     modelResp.Models[0].Id,
		Voiceprints: voiceprints,
	}

	// Opening audio file.
	audio, err := os.Open("test.wav")
	if err != nil {
		fmt.Printf("failed to open audio file: %v\n", err)
		os.Exit(1)
	}

	defer audio.Close()

	// Starting identification.
	result, err := StreamingIdentify(ctx, client, cfg, audio)
	if err != nil {
		fmt.Printf("failed to run streaming identification: %v\n", err)
		os.Exit(1)
	}

	// Server returns the index of the voiceprint that matches the best, a similarity
	// score for each voiceprint along with whether the score exceeded the server-configured
	// threshold for being a match.
	//
	// If none of the voiceprints were a good match, the best match index will be negative.
	matched := "❌ No Match found"
	if result.BestMatchIndex >= 0 {
		bestScore := result.VoiceprintComparisonResults[result.BestMatchIndex].SimilarityScore
		matched = fmt.Sprintf("✅ Match found: Index: %d, Score: %1.3f", result.BestMatchIndex, bestScore)
	}

	fmt.Printf("\nIdentification Result:\n")

	fmt.Printf("Scores:\n")
	for i, r := range result.VoiceprintComparisonResults {
		fmt.Printf("Index: %d, Score: %1.3f, IsMatch: %v\n", i, r.SimilarityScore, r.IsMatch)
	}

	fmt.Printf("\n%s\n", matched)
}

// StreamingIdentify wraps the streaming API for performing speaker identification
// using the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to VoiceBio
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
func StreamingIdentify(
	ctx context.Context,
	client voicebio.VoiceBioServiceClient,
	cfg *voicebio.IdentificationConfig,
	audio io.Reader,
) (*voicebio.StreamingIdentifyResponse, error) {
	const (
		streamingBufSize = 1024
	)

	// Creating stream.
	stream, err := client.StreamingIdentify(ctx)
	if err != nil {
		return nil, err
	}

	// Sending audio.
	if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
		// if sendAudio encountered io.EOF, it's only a
		// notification that the stream has closed.  The actual
		// status will be obtained in the CloseAndRecv call. We
		// therefore return on non-EOF errors here.
		return nil, err
	}

	// Returning result.
	return stream.CloseAndRecv()
}

// sendAudio sends the config and audio to a stream.
func sendAudio(
	stream voicebio.VoiceBioService_StreamingIdentifyClient,
	cfg *voicebio.IdentificationConfig,
	audio io.Reader,
	bufSize uint32,
) error {
	// The first message needs to be a config message, and all subsequent
	// messages must be audio messages.

	// Send the config.
	if err := stream.Send(&voicebio.StreamingIdentifyRequest{
		Request: &voicebio.StreamingIdentifyRequest_Config{Config: cfg},
	}); err != nil {
		// if this failed, we don't need to CloseSend
		return err
	}

	// Stream the audio.
	buf := make([]byte, bufSize)
	for {
		n, err := audio.Read(buf)
		if n > 0 {
			if err2 := stream.Send(&voicebio.StreamingIdentifyRequest{
				Request: &voicebio.StreamingIdentifyRequest_Audio{
					Audio: &voicebio.Audio{Data: buf[:n]},
				},
			}); err2 != nil {
				// if we couldn't Send, the stream has
				// encountered an error and we don't need to
				// CloseSend.
				return err2
			}
		}

		if err != nil {
			// err could be io.EOF, or some other error reading from
			// audio.  In any case, we need to CloseSend, send the
			// appropriate error to errCh and return from the function
			if err2 := stream.CloseSend(); err2 != nil {
				return err2
			}

			if err != io.EOF {
				return err
			}

			return nil
		}
	}
}

Streaming from microphone

  • Streaming audio from microphone input basically requires a reader interface that can provided audio samples recorded from a microphone; typically this requires interaction with system libraries. Another option is to use an external command line tool like sox to record and pipe audio into the client.

  • The examples below use the latter approach by using the rec command provided with sox to record and stream the audio.

#!/usr/bin/env python3

# This example assumes sox is installed on the system and is available
# in the system's PATH variable. Instead of opening a regular file from
# disk, we open a subprocess that executes sox's rec command to record
# audio from the system's default microphone.

import subprocess
import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)

# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)

# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())

print("Models:")
for model in modelResp.models:
    print(model)

# Select a model ID from the list above. Going with the first model
# in this example.
m = modelResp.models[0]
modelID = m.id

# Loading reference voiceprints.
voiceprints = []
for p in ["user1.bin", "user2.bin", "user3.bin"]:
    with open(p, 'r') as f:
        voiceprints.append(voicebio.Voiceprint(data=f.read().strip()))

# Setting audio format to be raw 16-bit signed little endian audio samples
# recorded at the sample rate expected by the model.
cfg = voicebio.IdentificationConfig(
    model_id=modelID,
    voiceprints=voiceprints,
    audio_format=voicebio.AudioFormat(
      audio_format_raw=voicebio.AudioFormatRAW(
        encoding="AUDIO_ENCODING_SIGNED",
        bit_depth=16,
        byte_order="BYTE_ORDER_LITTLE_ENDIAN",
        sample_rate=m.attributes.sample_rate,
        channels=1,
      )
    ),
)

# Open microphone stream using sox's rec command and record
# audio using the config specified above for *10 seconds*.
maxDuration = 10
cmd = f"rec -t raw -r {m.attributes.sample_rate} -e signed -b 16 -L -c 1 - trim 0 {maxDuration}"
mic = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
audio = mic.stdout

try:
  _ = audio.read(1024) # Trying to read some bytes as sanity check.
except Exception as err:
    print(f"[ERROR] failed to read audio from mic stream: {err}")

print(f"\n[INFO] recording {maxDuration} seconds of audio microphone ... \n")

# The first request to the server should only contain the
# recognition configuration. Subsequent requests should contain
# audio bytes. We can write a simple generator to do this.
def stream(cfg, audio, bufferSize=1024):
    yield voicebio.StreamingIdentifyRequest(config=cfg)

    data = audio.read(bufferSize)
    while len(data) > 0:
        yield voicebio.StreamingIdentifyRequest(audio=voicebio.Audio(data=data))
        data = audio.read(bufferSize)

# Streaming audio to the server.
result = client.StreamingIdentify(stream(cfg, audio))

# Server returns the index of the voiceprint that matches the best, a similarity
# score for each voiceprint along with whether the score exceeded the server-configured
# threshold for being a match.
#
# If none of the voiceprints were a good match, the best match index will be negative.
matched = "❌ No Match found"
if result.best_match_index >= 0:
    best_score = result.voiceprint_comparison_results[result.best_match_index].similarity_score
    matched = f"✅ Match found: Index: {result.best_match_index}, Score: {best_score:1.3f}"

print(f"\nIdentification Result:\n")

print("Scores:")
for i, r in enumerate(result.voiceprint_comparison_results):
    print(f"Index: {i}, Score: {r.similarity_score:1.3f}, IsMatch: {r.is_match}")

print(f"\n{matched}")

audio.close()
mic.kill()
package main

import (
	"context"
	"errors"
	"fmt"
	"io"
	"os"
	"os/exec"
	"strings"

	"golang.org/x/sync/errgroup"
	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)

func main() {
	const (
		serverAddress = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := voicebio.NewVoiceBioServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)

	// Get list model of models on the server.
	modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
	if err != nil {
		fmt.Printf("failed to get model list: %v\n", err)
		os.Exit(1)
	}

	fmt.Println("Models:")
	for _, m := range modelResp.Models {
		fmt.Println(m)
	}
	fmt.Println()

	// Selecting first model.
	m := modelResp.Models[0]

	// Reading voiceprint data.
	voiceprints := make([]*voicebio.Voiceprint, 0)

	for i, p := range []string{"user1.bin", "user2.bin", "user3.bin"} {
		data, err := os.ReadFile(p)
		if err != nil {
			fmt.Printf("\nfailed to read voiceprint[%d] data: %v\n", i, err)
			os.Exit(1)
		}

		voiceprints = append(voiceprints, &voicebio.Voiceprint{Data: string(data)})
	}

	// Setting audio format to be raw 16-bit signed little endian audio samples
	// recorded at the sample rate expected by the model.
	cfg := &voicebio.IdentificationConfig{
		ModelId:     m.Id,
		Voiceprints: voiceprints,
		AudioFormat: &voicebio.AudioFormat{AudioFormat: &voicebio.AudioFormat_AudioFormatRaw{
			AudioFormatRaw: &voicebio.AudioFormatRAW{
				Encoding:   voicebio.AudioEncoding_AUDIO_ENCODING_SIGNED,
				SampleRate: m.Attributes.SampleRate,
				BitDepth:   16,
				ByteOrder:  voicebio.ByteOrder_BYTE_ORDER_LITTLE_ENDIAN,
				Channels:   1,
			},
		},
		},
	}

	// Open microphone stream using sox's rec command and record
	// audio using the config specified above for *10 seconds*.
	maxDuration := 10
	args := fmt.Sprintf("-t raw -r %d -e signed -b 16 -L -c 1 - trim 0 %d", m.Attributes.SampleRate, maxDuration)
	cmd := exec.CommandContext(ctx, "rec", strings.Fields(args)...)
	cmd.Stderr = os.Stderr

	audio, err := cmd.StdoutPipe()
	if err != nil {
		fmt.Printf("failed to open microphone stream: %v\n", err)
		os.Exit(1)
	}

	// Starting routines to record from microphone and stream to server
	// using an errgroup.Group that returns if either one encounters an error.
	eg, ctx := errgroup.WithContext(ctx)

	eg.Go(func() error {
		fmt.Printf("\n[INFO] recording %d seconds from microphone \n", maxDuration)

		if err := cmd.Run(); err != nil {
			return fmt.Errorf("record from microphone: %w", err)
		}

		return nil
	})

	// Starting identification.
	result, err := StreamingIdentify(ctx, client, cfg, audio)
	if err != nil {
		fmt.Printf("failed to run streaming identification: %v\n", err)
		os.Exit(1)
	}

	// Server returns the index of the voiceprint that matches the best, a similarity
	// score for each voiceprint along with whether the score exceeded the server-configured
	// threshold for being a match.
	//
	// If none of the voiceprints were a good match, the best match index will be negative.
	matched := "❌ No Match found"
	if result.BestMatchIndex >= 0 {
		bestScore := result.VoiceprintComparisonResults[result.BestMatchIndex].SimilarityScore
		matched = fmt.Sprintf("✅ Match found: Index: %d, Score: %1.3f", result.BestMatchIndex, bestScore)
	}

	fmt.Printf("\nIdentification Result:\n")

	fmt.Printf("Scores:\n")
	for i, r := range result.VoiceprintComparisonResults {
		fmt.Printf("Index: %d, Score: %1.3f, IsMatch: %v\n", i, r.SimilarityScore, r.IsMatch)
	}

	fmt.Printf("\n%s\n", matched)
}

// StreamingIdentify wraps the streaming API for performing speaker identification
// using the given cfg.
//
// Data is read from the given audio reader into a buffer and streamed to VoiceBio
// server. The default buffer size may be overridden using Options when creating
// the Client.
//
// If any error occurs while reading the audio or sending it to the server, this
// method will immediately exit, returning that error.
func StreamingIdentify(
	ctx context.Context,
	client voicebio.VoiceBioServiceClient,
	cfg *voicebio.IdentificationConfig,
	audio io.Reader,
) (*voicebio.StreamingIdentifyResponse, error) {
	const (
		streamingBufSize = 1024
	)

	// Creating stream.
	stream, err := client.StreamingIdentify(ctx)
	if err != nil {
		return nil, err
	}

	// Sending audio.
	if err := sendAudio(stream, cfg, audio, streamingBufSize); err != nil && !errors.Is(err, io.EOF) {
		// if sendAudio encountered io.EOF, it's only a
		// notification that the stream has closed.  The actual
		// status will be obtained in the CloseAndRecv call. We
		// therefore return on non-EOF errors here.
		return nil, err
	}

	// Returning result.
	return stream.CloseAndRecv()
}

// sendAudio sends the config and audio to a stream.
func sendAudio(
	stream voicebio.VoiceBioService_StreamingIdentifyClient,
	cfg *voicebio.IdentificationConfig,
	audio io.Reader,
	bufSize uint32,
) error {
	// The first message needs to be a config message, and all subsequent
	// messages must be audio messages.

	// Send the config.
	if err := stream.Send(&voicebio.StreamingIdentifyRequest{
		Request: &voicebio.StreamingIdentifyRequest_Config{Config: cfg},
	}); err != nil {
		// if this failed, we don't need to CloseSend
		return err
	}

	// Stream the audio.
	buf := make([]byte, bufSize)
	for {
		n, err := audio.Read(buf)
		if n > 0 {
			if err2 := stream.Send(&voicebio.StreamingIdentifyRequest{
				Request: &voicebio.StreamingIdentifyRequest_Audio{
					Audio: &voicebio.Audio{Data: buf[:n]},
				},
			}); err2 != nil {
				// if we couldn't Send, the stream has
				// encountered an error and we don't need to
				// CloseSend.
				return err2
			}
		}

		if err != nil {
			// err could be io.EOF, or some other error reading from
			// audio.  In any case, we need to CloseSend, send the
			// appropriate error to errCh and return from the function
			if err2 := stream.CloseSend(); err2 != nil {
				return err2
			}

			if err != io.EOF {
				return err
			}

			return nil
		}
	}
}

2.7 - Comparing Voiceprints

Describes how to compare pre-extracted voiceprints using VoiceBio’s CompareVoiceprints API.
  • The CompareVoiceprints endpoint allows the user to compare pre-extracted voiceprints and get similarity scores and match results without needing to send audio data.

  • This is useful in cases where the user wants to compare a given voiceprint against a large number of other voiceprints, and sending audio data for each comparison would be inefficient. The client can enroll the voiceprint once using the StreamingEnroll method, and then use this method to compare it against a large number of other voiceprints in batches.

  • The following example shows how to compare pre-extracted voiceprints using VoiceBio’s CompareVoiceprints API, without streaming audio. The voiceprints can be loaded from files on disk or obtained from previous enrollment sessions.

import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)

# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)

# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())

print("Models:")
for model in modelResp.models:
    print(model)

# Select a model ID from the list above. Going with the first model
# in this example. The model ID should be the same as the one used to
# generate the voiceprints being compared.
modelID = modelResp.models[0].id

# Loading reference voiceprints.
reference_voiceprints = []
for p in ["user1.bin", "user2.bin", "user3.bin"]:
    with open(p, 'r') as f:
        reference_voiceprints.append(voicebio.Voiceprint(data=f.read().strip()))

# Load the target voiceprint that we want to compare against the reference voiceprints.
with open("unknown.bin", 'r') as f:
    target_voiceprint = voicebio.Voiceprint(data=f.read().strip())

# Set the comparison config.
req = voicebio.CompareVoiceprintsRequest(
    model_id=modelID,
    target_voiceprint=target_voiceprint,
    reference_voiceprints=reference_voiceprints,
)

# Compare voiceprints.
result = client.CompareVoiceprints(req)

# Server returns the index of the voiceprint that matches the best, a similarity
# score for each voiceprint along with whether the score exceeded the server-configured
# threshold for being a match.
#
# If none of the voiceprints were a good match, the best match index will be negative.
matched = "❌ No Match found"
if result.best_match_index >= 0:
    best_score = result.voiceprint_comparison_results[result.best_match_index].similarity_score
    matched = f"✅ Match found: Index: {result.best_match_index}, Score: {best_score:1.3f}"

print(f"\nComparison Result:\n")

print("Scores:")
for i, r in enumerate(result.voiceprint_comparison_results):
    print(f"Index: {i}, Score: {r.similarity_score:1.3f}, IsMatch: {r.is_match}")

print(f"\n{matched}")
package main

import (
	"context"
	"fmt"
	"os"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)

func main() {
	const (
		serverAddress = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := voicebio.NewVoiceBioServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)

	// Get list model of models on the server.
	modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
	if err != nil {
		fmt.Printf("failed to get model list: %v\n", err)
		os.Exit(1)
	}

	fmt.Println("Models:")
	for _, m := range modelResp.Models {
		fmt.Println(m)
	}
	fmt.Println()

	// Reading voiceprint data.
	reference_voiceprints := make([]*voicebio.Voiceprint, 0)

	for i, p := range []string{"user1.bin", "user2.bin", "user3.bin"} {
		data, err := os.ReadFile(p)
		if err != nil {
			fmt.Printf("\nfailed to read voiceprint[%d] data: %v\n", i, err)
			os.Exit(1)
		}

		reference_voiceprints = append(reference_voiceprints, &voicebio.Voiceprint{Data: string(data)})
	}

	// Load the target voiceprint that we want to compare against the reference voiceprints.
	data, err := os.ReadFile("unknown.bin")
	if err != nil {
		fmt.Printf("failed to read target voiceprint data: %v\n", err)
		os.Exit(1)
	}

	target_voiceprint := &voicebio.Voiceprint{Data: string(data)}

	// Selecting the first model. The model ID should be the same as the one used to generate the
	// voiceprints being compared.
	req := &voicebio.CompareVoiceprintsRequest{
		ModelId:            modelResp.Models[0].Id,
		TargetVoiceprint:   target_voiceprint,
		ReferenceVoiceprints: reference_voiceprints,
	}

	// Compare voiceprints.
	result, err := client.CompareVoiceprints(ctx, req)
	if err != nil {
		fmt.Printf("failed to compare voiceprints: %v\n", err)
		os.Exit(1)
	}

	// Server returns the index of the voiceprint that matches the best, a similarity
	// score for each voiceprint along with whether the score exceeded the server-configured
	// threshold for being a match.
	//
	// If none of the voiceprints were a good match, the best match index will be negative.
	matched := "❌ No Match found"
	if result.BestMatchIndex >= 0 {
		bestScore := result.VoiceprintComparisonResults[result.BestMatchIndex].SimilarityScore
		matched = fmt.Sprintf("✅ Match found: Index: %d, Score: %1.3f", result.BestMatchIndex, bestScore)
	}

	fmt.Printf("\n Comparison Result:\n")

	fmt.Printf("Scores:\n")
	for i, r := range result.VoiceprintComparisonResults {
		fmt.Printf("Index: %d, Score: %1.3f, IsMatch: %v\n", i, r.SimilarityScore, r.IsMatch)
	}

	fmt.Printf("\n%s\n", matched)
}

2.8 - Vectorizing Voiceprints

Describes how to convert voiceprints into vector representations using VoiceBio’s VectorizeVoiceprints API.
  • Voiceprints can also be vectorized using the VectorizeVoiceprints API, which returns a vector representation of each voiceprint that can be used for downstream tasks such as clustering, custom scoring, other machine learning models or even semantic searching in vector databases.

  • See the API reference for more details.

  • The following example shows how to use the VectorizeVoiceprints API to vectorize voiceprints. The voiceprints can be loaded from files on disk or obtained from previous enrollment sessions.

import numpy as np

import grpc
import cobaltspeech.voicebio.v1.voicebio_pb2_grpc as stub
import cobaltspeech.voicebio.v1.voicebio_pb2 as voicebio

serverAddress = "localhost:2727"

# Using a channel without TLS enabled.
channel = grpc.insecure_channel(serverAddress)
client = stub.VoiceBioServiceStub(channel)

# Get server version.
versionResp = client.Version(voicebio.VersionRequest())
print(versionResp)

# Get list of models on the server.
modelResp = client.ListModels(voicebio.ListModelsRequest())

print("Models:")
for model in modelResp.models:
    print(model)

# Select a model ID from the list above. Going with the first model
# in this example. The model ID should be the same as the one used to
# generate the voiceprints being vectorized.
modelID = modelResp.models[0].id

# Loading voiceprints.
voiceprints = []
for p in ["user1.bin", "user2.bin", "user3.bin"]:
    with open(p, 'r') as f:
        voiceprints.append(voicebio.Voiceprint(data=f.read().strip()))

# Set the vectorization config.
req = voicebio.VectorizeVoiceprintsRequest(
    model_id=modelID,
    voiceprints=voiceprints,
)

# Vectorize voiceprints.
result = client.VectorizeVoiceprints(req)

# The server returns a list of vectorized voiceprints in the same order as the input voiceprints.
#
# In most cases, the vectorized voiceprints can be compared using simple distance metrics such as
# cosine similarity or euclidean distance. This is not guaranteed, however, and depends on the model
# used to generate the voiceprints and vectorize them.

# Example using cosine similarity.
n = len(result.voiceprints)
similarity = np.zeros((n, n), dtype=np.float32)

for i, vi in enumerate(result.voiceprints):
  for j, vj in enumerate(result.voiceprints):
    similarity[i, j] = np.dot(vi.data, vj.data) / (np.linalg.norm(vi.data) * np.linalg.norm(vj.data))

print("Cosine Similarity Matrix:")
print(similarity)
package main

import (
	"context"
	"fmt"
	"os"

	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"

	voicebio "github.com/cobaltspeech/go-genproto/cobaltspeech/voicebio/v1"
)

func main() {
	const (
		serverAddress = "localhost:2727"
	)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	opts := []grpc.DialOption{
		grpc.WithTransportCredentials(insecure.NewCredentials()), // Using a channel without TLS enabled.
		grpc.WithBlock(),
		grpc.WithReturnConnectionError(),
		grpc.FailOnNonTempDialError(true),
	}

	conn, err := grpc.DialContext(ctx, serverAddress, opts...)
	if err != nil {
		fmt.Printf("failed to dial gRPC connection: %v\n", err)
		os.Exit(1)
	}

	client := voicebio.NewVoiceBioServiceClient(conn)

	// Get server version.
	versionResp, err := client.Version(ctx, &voicebio.VersionRequest{})
	if err != nil {
		fmt.Printf("failed to get server version: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("%v\n", versionResp)

	// Get list model of models on the server.
	modelResp, err := client.ListModels(ctx, &voicebio.ListModelsRequest{})
	if err != nil {
		fmt.Printf("failed to get model list: %v\n", err)
		os.Exit(1)
	}

	fmt.Println("Models:")
	for _, m := range modelResp.Models {
		fmt.Println(m)
	}
	fmt.Println()

	// Reading voiceprint data.
	voiceprints := make([]*voicebio.Voiceprint, 0)

	for i, p := range []string{"user1.bin", "user2.bin", "user3.bin"} {
		data, err := os.ReadFile(p)
		if err != nil {
			fmt.Printf("\nfailed to read voiceprint[%d] data: %v\n", i, err)
			os.Exit(1)
		}

		voiceprints = append(voiceprints, &voicebio.Voiceprint{Data: string(data)})
	}

	// Selecting the first model. The model ID should be the same as the one used to generate the
	// voiceprints being compared.
	req := &voicebio.VectorizeVoiceprintsRequest{
		ModelId:            modelResp.Models[0].Id,
		Voiceprints:        voiceprints,
	}

	// Vectorize voiceprints.
	result, err := client.VectorizeVoiceprints(ctx, req)
	if err != nil {
		fmt.Printf("failed to vectorize voiceprints: %v\n", err)
		os.Exit(1)
	}

	// The server returns a list of vectorized voiceprints in the same order as the input voiceprints.
	//
	// In almost cases, the vectorized voiceprints can be compared using simple distance metrics such as
	// cosine similarity or euclidean distance. This  is not guaranteed, however, and depends on the model
	// used to generate the voiceprints and vectorize them.

	// Example using cosine similarity.
	n := len(result.Voiceprints)
	similarity := make([][]float32, n)
	for i := range similarity {
		similarity[i] = make([]float32, n)
	}

	for i, vi := range result.Voiceprints {
		for j, vj := range result.Voiceprints {
			dotProduct := float32(0.0)
			normVi := float32(0.0)
			normVj := float32(0.0)

			for k := range vi.Data {
				dotProduct += vi.Data[k] * vj.Data[k]
				normVi += vi.Data[k] * vi.Data[k]
				normVj += vj.Data[k] * vj.Data[k]
			}

            denom := float32(math.Sqrt(float64(normVi)) * math.Sqrt(float64(normVj)))
			similarity[i][j] = dotProduct / denom
		}
	}

	fmt.Printf("Cosine Similarity Matrix:\n")
	for i := range similarity {
		for j := range similarity[i] {
			fmt.Printf("%1.3f ", similarity[i][j])
		}

		fmt.Println()
    }
}

2.9 - API Reference

Detailed reference for API requests and types.

The API is defined as a protobuf spec, so native bindings can be generated in any language with gRPC support. We recommend using buf to generate the bindings.

This section of the documentation is auto-generated from the protobuf spec. The service contains the methods that can be called, and the “messages” are the data structures (objects, classes or structs in the generated code, depending on the language) passed to and from the methods.

VoiceBioService

Service that implements the Cobalt VoiceBio API.

Version

Version(VersionRequest) VersionResponse

Returns version information from the server.

ListModels

ListModels(ListModelsRequest) ListModelsResponse

Returns information about the models available on the server.

StreamingEnroll

StreamingEnroll(StreamingEnrollRequest) StreamingEnrollResponse

Uses new audio data to perform enrollment of new users, or to update enrollment of existing users. Returns a new or updated voiceprint.

Clients should store the returned voiceprint against the ID of the user that provided the audio. This voiceprint can be provided later, with the Verify or Identify requests to match new audio against known speakers.

If this call is used to update an existing user’s voiceprint, the old voiceprint can be discarded and only the new one can be stored for that user.

StreamingVerify

StreamingVerify(StreamingVerifyRequest) StreamingVerifyResponse

Compares audio data against the provided voiceprint and verifies whether or not the audio matches against the voiceprint.

StreamingIdentify

StreamingIdentify(StreamingIdentifyRequest) StreamingIdentifyResponse

Compares audio data against the provided list of voiceprints and identifies which (or none) of the voiceprints is a match for the given audio.

VectorizeVoiceprints

VectorizeVoiceprints(VectorizeVoiceprintsRequest) VectorizeVoiceprintsResponse

Converts the given voiceprints into numerical vector representations that can be used for various downstream tasks such as clustering, visualization, or as input features for other machine learning models. The specific format and dimensionality of these vectors may vary depending on the model used.

CompareVoiceprints

CompareVoiceprints(CompareVoiceprintsRequest) CompareVoiceprintsResponse

Compares pre-extracted voiceprints and returns similarity scores and match results without needing to send audio data. This is useful in cases where the user wants to compare a given voiceprint against a large number of other voiceprints, and sending audio data for each comparison would be inefficient. The client can enroll the voiceprint once using the StreamingEnroll method, and then use this method to compare it against a large number of other voiceprints in batches.

Messages

  • If two or more fields in a message are labeled oneof, then each method call using that message must have exactly one of the fields populated
  • If a field is labeled repeated, then the generated code will accept an array (or struct, or list depending on the language).

Audio

Audio to be sent to VoiceBio.

Fields

AudioFormat

Format of the audio to be sent for recognition.

Depending on how they are configured, server instances of this service may not support all the formats provided in the API. One format that is guaranteed to be supported is the RAW format with little-endian 16-bit signed samples with the sample rate matching that of the model being requested.

Fields

  • oneof audio_format.audio_format_raw (AudioFormatRAW ) Audio is raw data without any headers

  • oneof audio_format.audio_format_headered (AudioFormatHeadered ) Audio has a self-describing header. Headers are expected to be sent at the beginning of the entire audio file/stream, and not in every Audio message.

    The default value of this type is AUDIO_FORMAT_HEADERED_UNSPECIFIED. If this value is used, the server may attempt to detect the format of the audio. However, it is recommended that the exact format be specified.

AudioFormatRAW

Details of audio in raw format

Fields

  • encoding (AudioEncoding ) Encoding of the samples. It must be specified explicitly and using the default value of AUDIO_ENCODING_UNSPECIFIED will result in an error.

  • bit_depth (uint32 ) Bit depth of each sample (e.g. 8, 16, 24, 32, etc.). This is a required field.

  • byte_order (ByteOrder ) Byte order of the samples. This field must be set to a value other than BYTE_ORDER_UNSPECIFIED when the bit_depth is greater than 8.

  • sample_rate (uint32 ) Sampling rate in Hz. This is a required field.

  • channels (uint32 ) Number of channels present in the audio. E.g.: 1 (mono), 2 (stereo), etc. This is a required field.

CompareVoiceprintsRequest

The top level message sent by the client for the CompareVoiceprints method. This is similar to StreamingIdentifyRequest, but operates on pre-extracted voiceprints without sending any audio data.

Fields

  • model_id (string ) ID of the model to use for comparison. The model used for comparison must match with the model used for enrollment of the voiceprints. A list of supported IDs can be found using the ListModels call.

  • target_voiceprint (Voiceprint ) The voiceprint to compare against the reference voiceprints.

  • reference_voiceprints (Voiceprint repeated) Voiceprints that should be compared against the target voiceprint.

CompareVoiceprintsResponse

The message returned by the server for the CompareVoiceprints method. This contains the similarity scores and match results for comparing the target voiceprint against each of the reference voiceprints, as well as the index of the best matching voiceprint in the reference list, if any of them is a match. This is similar to StreamingIdentifyResponse, but operates on pre-extracted voiceprints without sending any audio data.

Fields

  • best_match_index (int32 ) Index (0-based) of the best matching voiceprint in the list of reference voiceprints provided in the CompareVoiceprintsRequest message. If none of the voiceprints was a match, a negative value is returned.

  • voiceprint_comparison_results (VoiceprintComparisonResult repeated) Result of comparing the given the target voiceprint against each of the reference voiceprints. The order of this list is the same as the reference voiceprint list provided in the CompareVoiceprintsRequest message.

EnrollmentConfig

Configuration for Enrollment of speakers.

Fields

  • model_id (string ) ID of the model to use for enrollment. A list of supported IDs can be found using the ListModels call.

  • audio_format (AudioFormat ) Format of the audio to be sent for enrollment.

  • previous_voiceprint (Voiceprint ) Empty string for new users. For re-enrolling additional users with new audio data, set this to that user’s previous voiceprint. The previous voiceprint needs to have been generated using the same model as specified in this config.

EnrollmentStatus

The message returned as part of StreamingEnrollResponse, to provide information about whether voiceprint is sufficiently trained.

Fields

  • enrollment_complete (bool ) Whether sufficient data has been provided as part of this user’s enrollment. If this is false, more audio should be collected from the user and re-enrollment should be done. If this is true, it is still OK to enroll more data for the same user to update the voiceprint.

  • additional_audio_required_seconds (uint32 ) If enrollment is not yet complete, how many more seconds of user’s speech are required to complete the enrollment. If enrollment is completed successfully, this value will be set to 0.

IdentificationConfig

Configuration for Identification of a speaker.

Fields

  • model_id (string ) ID of the model to use for identification. A list of supported IDs can be found using the ListModels call. The model used for identification must match with the model used for enrollment.

  • audio_format (AudioFormat ) Format of the audio to be sent for enrollment.

  • voiceprints (Voiceprint repeated) Voiceprints of potential speakers that need to be identified in the given audio.

ListModelsRequest

The top-level message sent by the client for the ListModels method.

ListModelsResponse

The message returned to the client by the ListModels method.

Fields

  • models (Model repeated) List of models available for use that match the request.

Model

Description of a VoiceBio model.

Fields

  • id (string ) Unique identifier of the model. This identifier is used to choose the model that should be used for enrollment, verification or identification requests. This ID needs to be specified in the respective config messages for these requests.

  • name (string ) Model name. This is a concise name describing the model, and may be presented to the end-user, for example, to help choose which model to use for their voicebio task.

  • attributes (ModelAttributes ) Model Attributes

ModelAttributes

Attributes of a VoiceBio model

Fields

  • sample_rate (uint32 ) Audio sample rate (native) supported by the model

StreamingEnrollRequest

The top level messages sent by the client for the StreamingEnroll method. In this streaming call, multiple StreamingEnrollRequest messages should be sent. The first message must contain a EnrollmentConfig message, and all subsequent messages must contain Audio only. All Audio messages must contain non-empty audio. If audio content is empty, the server may choose to interpret it as end of stream and stop accepting any further messages.

Fields

StreamingEnrollResponse

The message returned by the server for the StreamingEnroll method.

Fields

StreamingIdentifyRequest

The top level messages sent by the client for the StreamingIdentify method. In this streaming call, multiple StreamingIdentifyRequest messages should be sent. The first message must contain a IdentificationConfig message, and all subsequent messages must contain Audio only. All Audio messages must contain non-empty audio. If audio content is empty, the server may choose to interpret it as end of stream and stop accepting any further messages.

Fields

StreamingIdentifyResponse

The message returned by the server for the StreamingIdentify method.

Fields

  • best_match_index (int32 ) Index (0-based) of the best matching voiceprint in the list of input voiceprints provided in the IdentificationConfig message. If none of the voiceprints was a match, a negative value is returned.

  • voiceprint_comparison_results (VoiceprintComparisonResult repeated) Result of comparing the given audio against each of the input voiceprints. The order of this list is the same as the input voiceprint list provided in the IdentificationConfig message.

StreamingVerifyRequest

The top level messages sent by the client for the StreamingVerify method. In this streaming call, multiple StreamingVerifyRequest messages should be sent. The first message must contain a VerificationConfig message, and all subsequent messages must contain Audio only. All Audio messages must contain non-empty audio. If audio content is empty, the server may choose to interpret it as end of stream and stop accepting any further messages.

Fields

StreamingVerifyResponse

The message returned by the server for the StreamingVerify method.

Fields

VectorVoiceprint

Voiceprint represented in vector form. The specific format and dimensionality of this vector may vary depending on the model used. The VectorizeVoiceprints method can be used to convert a Voiceprint into a VectorVoiceprint representation.

Fields

  • data (float repeated) List of floating point values representing the voiceprint in vector form.

VectorizeVoiceprintsRequest

The top level message sent by the client for the VectorizeVoiceprints method.

Fields

  • model_id (string ) ID of the model to use for vectorization. The model used for vectorization must match with the model used for enrollment of the voiceprints. A list of supported IDs can be found using the ListModels call.

  • voiceprints (Voiceprint repeated) Voiceprints to be vectorized.

VectorizeVoiceprintsResponse

The message returned by the server for the VectorizeVoiceprints method.

Fields

  • voiceprints (VectorVoiceprint repeated) Voiceprint data converted into a vector representation, which can be used for various downstream tasks such as clustering, visualization, or as input features for other machine learning models. The specific format and dimensionality of these vectors may vary depending on the model used.

    The order of this list is the same as the input voiceprint list provided in the VectorizeVoiceprintsRequest message.

VerificationConfig

Configuration for Verification of a speaker.

Fields

  • model_id (string ) ID of the model to use for verification. A list of supported IDs can be found using the ListModels call. The model used for verification must match with the model used for enrollment.

  • audio_format (AudioFormat ) Format of the audio to be sent for enrollment.

  • voiceprint (Voiceprint ) Voiceprint with which audio should be compared.

VersionRequest

The top-level message sent by the client for the Version method.

VersionResponse

The message sent by the server for the Version method.

Fields

  • version (string ) Version of the server handling these requests.

Voiceprint

Voiceprint extracted from user’s audio.

Fields

  • data (string ) Voiceprint data serialized to a string.

VoiceprintComparisonResult

Message describing the result of comparing a voiceprint against given audio.

Fields

  • is_match (bool ) Whether or not the audio successfully matches with the provided voiceprint.

  • similarity_score (float ) Similarity score representing how closely the audio matched against the voiceprint. This score could be any negative or positive number. Lower value suggests that the audio and voiceprints are less similar, whereas a higher value indicates more similarity. The is_match field can be used to actually decide if the result should be considered a valid match.

Enums

AudioEncoding

The encoding of the audio data to be sent for recognition.

Name Number Description
AUDIO_ENCODING_UNSPECIFIED 0 AUDIO_ENCODING_UNSPECIFIED is the default value of this type and will result in an error.
AUDIO_ENCODING_SIGNED 1 PCM signed-integer
AUDIO_ENCODING_UNSIGNED 2 PCM unsigned-integer
AUDIO_ENCODING_IEEE_FLOAT 3 PCM IEEE-Float
AUDIO_ENCODING_ULAW 4 G.711 mu-law
AUDIO_ENCODING_ALAW 5 G.711 a-law

AudioFormatHeadered

Name Number Description
AUDIO_FORMAT_HEADERED_UNSPECIFIED 0 AUDIO_FORMAT_HEADERED_UNSPECIFIED is the default value of this type.
AUDIO_FORMAT_HEADERED_WAV 1 WAV with RIFF headers
AUDIO_FORMAT_HEADERED_MP3 2 MP3 format with a valid frame header at the beginning of data
AUDIO_FORMAT_HEADERED_FLAC 3 FLAC format
AUDIO_FORMAT_HEADERED_OGG_OPUS 4 Opus format with OGG header

ByteOrder

Byte order of multi-byte data

Name Number Description
BYTE_ORDER_UNSPECIFIED 0 BYTE_ORDER_UNSPECIFIED is the default value of this type.
BYTE_ORDER_LITTLE_ENDIAN 1 Little Endian byte order
BYTE_ORDER_BIG_ENDIAN 2 Big Endian byte order

Scalar Value Types

.proto Type C++ Type C# Type Go Type Java Type PHP Type Python Type Ruby Type

double
double double float64 double float float Float

float
float float float32 float float float Float

int32
int32 int int32 int integer int Bignum or Fixnum (as required)

int64
int64 long int64 long integer/string int/long Bignum

uint32
uint32 uint uint32 int integer int/long Bignum or Fixnum (as required)

uint64
uint64 ulong uint64 long integer/string int/long Bignum or Fixnum (as required)

sint32
int32 int int32 int integer int Bignum or Fixnum (as required)

sint64
int64 long int64 long integer/string int/long Bignum

fixed32
uint32 uint uint32 int integer int Bignum or Fixnum (as required)

fixed64
uint64 ulong uint64 long integer/string int/long Bignum

sfixed32
int32 int int32 int integer int Bignum or Fixnum (as required)

sfixed64
int64 long int64 long integer/string int/long Bignum

bool
bool bool bool boolean boolean boolean TrueClass/FalseClass

string
string string string String string str/unicode String (UTF-8)

bytes
string ByteString []byte ByteString string str String (ASCII-8BIT)

2.10 -

Cobalt&rsquo;s SDK Documentation

Cobalt VoiceBio SDK – Cobalt