FAQ

Answers to Frequently Asked Questions.

System Requirements

Does Cobalt Transcribe run on Linux?

Yes, you can run Cobalt Transcribe on Linux natively or via Docker. Check out the documentation to get started.

Does Cobalt Transcribe run on macOS?

Yes, you can run Cobalt Transcribe on macOS via Docker Desktop for evaluation purpose. Check out the documentation to get started. However, we don’t recommend running Cobalt Transcribe on macOS in production.

Does Cobalt Transcribe run on Windows?

Yes, Windows is supported via Docker Desktop for evaluation. We don’t recommend running Cobalt Transcribe on Windows in production.

Does Cobalt Transcribe run on embedded devices?

Yes, Cobalt Transcribe supports embedded devices such as Raspberry Pi, Tegra etc. However, you’ll probably want to contact us for a smaller model due to memory limitations.

Does Cobalt Transcribe run on Android or iOS?

Android and iOS require a specific implementation strategy. Please contact us for support working with Android or iOS.

What are the technical requirements for a scaled on-premise deployment?

Each containerized instance of Cobalt Transcribe should be provided with 4 cores and 8 GB RAM when using for streaming recognition.

Product Features

Which languages does Cobalt Transcribe support?

Cobalt offers speech recognition in English (US & UK), Spanish, French, German, Russian, Brazilian Portuguese, Korean, Japanese, Swahili, Cambodian. Please contact sales@cobaltspeech.com to learn more. Cobalt is always looking for partners to develop, sell, and/or market speech technology in other languages.

Can I use Cobalt Transcribe in the field of telephony such as contact centers?

Yes, Cobalt Transcribe uses low latency and/or 8kHz telephony models for transcribing telephone calls and contact center conversations. Additional insight is retrivable through high-precision timestamps and n-best transcripts. Cobalt technology provides solutions for contact centers including summarization, redaction, and sentiment analysis of conversations.

Can I redact Personal Identifiable Inforamtion (PII) from the output transcript?

PII redaction is a separate service that can be integrated with Cobalt Transcribe. Please contact us for details.

Does Cobalt Transcribe support real-time transcription?

Yes, Cobalt Transcribe can accept audio samples as they are recorded and will provide streaming output with relatively low latency. It also supports output of partial results, which is available almost immediately during decoding. This feature is useful in real-time interfaces where users can see what is being recognized nearly immediately as they speak, but some words in the preliminary output may be corrected in a final result as more audio and context is available. Cobalt transcribe performs automatic end-pointing to determine the end of utterance.

Recognition Accuracy

How accurate is Cobalt Transcribe?

Cobalt Transcribe is available in two different architectures: Hybrid and End-to-End. We have evaluated the word error rate (WER) of both versions of Cobalt Transcribe on several industry-standard test datasets:

Dataset	Domain	Hybrid WER	End-to-End WER
CommonVoice-test	Read Speech, Crowd Sourced	11.5%	5.0%
Librispeech-test	Read Speech, Audiobooks, Crowd Sourced	6.2%	2.2%
Tedlium-test	Spontaneous Speech, Presentations	7.5%	3.9%
WallStreetJournal-test	Read Speech, News	7.4%	5.8%
MultilingualLibriSpeech-test	Read Speech, Audiobooks, Crowd Sourced	8.8%	4.0%
OHSU-test	Spontaneous Speech, Children’s Speech	16.9%	12.4%

The WER is dependent on a number of factors such as the train-test split, formatting of the decoded transcript, accuracy vs. latency trade-offs, etc. Therefore, these numbers are not directly comparable to the WERs reported by other service providers, even on the same dataset.

How do I further improve audio transcription accuracy?

Our base models are trained on a large amount of audio and text to ensure robust accuracy on a variety of use cases. The configurable nature of Cobalt Transcribe’s models allows for updates that can improve transcription accuracy specific to your use-case:

Adding vocabulary and context via the RecognitionContext API: This will help you to capture proper names and domain-specific terminology correctly.
End-to-end (E2E) models typically have better accuracy and more robust recognition performance for different accents and dialects. However, E2E models are more computationally expensive, and tend to have higher latency when compared to hybrid models. If you would like to try one out please contact us at sales@cobaltspeech.com
Continuous adaptation of acoustic models (AMs) using Cobalt Transcribe Tuner. This continuous learning framework automatically updates the acoustic model using your production data. For more information, contact sales@cobaltspeech.com.
Cobalt’s speech scientists can work with you to optimize accuracy for your conditions and application: For speech recognition in a specific acoustic environment or domain-specific use case (e.g. noisy factory floor, airport, surgical lab, patient-doctor conversations, quarterly earnings calls, etc.) we can adapt the acoustic and language models using relevant audio and text data.

I am starting a new speech project. How can I get the best transcription accuracy?

The transcription accuracy depends on several factors such as:

Appropriate sampling rate (8kHz / 16kHz) and matching model
Audio format : Lossless codecs like wav or flac may be better over mp3 or ogg
Microphone selection, placement, directionality (Cardiod, Omni)
Trade-offs between latency and accuracy.
Consider constrained grammar or providing recognition context
AM and LM adaptation

Recognition Speed/Performance

How do I further improve audio transcription latency?

One way to improve the latency is to make the streaming buffer size smaller. We recommend setting the streaming buffer size between 512 bytes and 4096 bytes. We can also work with you to tune model parameters such as beam search width to reduce the latency. Our speech scientists can also make a smaller model for your application, or tune parameters and model size for optimal latency and accuracy trade-offs. If you’re interested in this, contact sales@cobaltspeech.com.

How long does it take to transcribe audio?

The processing speed of speech to text conversion is measured by the real time factor (RTF) which is the ratio of time taken to transcribe an audio file to the duration of the audio. Cobalt Transcribe has an RTF of 0.16 and 0.4 using our general purpose hybrid and E2E models, respectively. That means transcribing one hour of speech typically takes approximately 10 minutes for the hybrid model and about 24 minutes for the E2E model.

Are there limits on the number of jobs that can be processed concurrently?

Number of concurrent audio channels depends on the models being used and the CPU. Our general purpose models typically support 6 channels per core for realtime streams when running on a CPU such as C6i EC2 instances.

What does it cost, in terms of CPU resources, to transcribe a million minutes of speech?

With our standard models, Cobalt Transcribe can run 6 channels per core for realtime speech input, assuming a c6i EC2 processor. A 4-core processor can therefore transcribe 24 minutes of audio per minute of wall-clock time. At current AWS pricing, an EC2 instance costs $0.17 per hour. Therefore, the cost for a million minutes of speech is approximately $120.

Costs can be significantly lower when using c7g instances on EC2. Contact us for more information.

How scalable is Cobalt Transcribe? How can I carry out large deployments?

Cobalt Transcribe can scale between large-scale servers and low-power embedded hardware. For large-scale deployments, you can increase the number of concurrent audio channels for faster decoding. Cobalt Transcribe has the capability to decode using separate threads. Our general purpose models typically support 6 channels per core when running on a CPU such as C6i EC2 instances. Moreover, you can deploy Cobalt Transcribe via docker and kubernetes to automatically scale up (or down) your resources according to your demand in a cost effective manner without causing a decline in performance.

Recognition accuracy vs speed/performance

How do I choose between hybrid and end-to-end models?

End-to-End models are likely to be the best choice for customers that are primarily concerned with maximizing accuracy for general transcription. However, the hybrid models may be more appropriate and even more accurate under some conditions: very low latency streaming, low compute/memory embedded transcription, highly custom/unique vocabulary, a very narrow domain (ex: speaking a small number of device directed commands), vocabulary or expected command sets that can change often (even between each audio stream passed as input). For detailed comparison between hybrid and end-to-end models, you may take a look at Hybrid vs End-to-End Models.

Supported Audio Formats

What type of media files does Cobalt Transcribe support?

Cobalt Transcribe supports common media formats, such as WAV, MP3, FLAC, Ogg and audio encoding like PCM, mu-law, and a-law. Raw audio format is also supported.

My audio source is 48kHz/44.1kHz. Does Cobalt Transcribe support that?

Yes, we resample the audio to an appropriate sampling rate automatically. Please note that our default sampling rates are 16kHz for wideband models and 8kHz for telephony models. Accuracy improvements for higher sampling rates (than 16kHz) are minimal, and not generally worth the associated increase in data rates and data transfer requirements, or the additional overhead for resampling.

API and Integration

How do I test Cobalt Transcribe?

We are happy to offer free trials under a software evaluation license of Cobalt Transcribe with all available features. Typically, our software evaluation licenses are for a period of 30 days. To get started with Cobalt Transcribe, please check the quick start.

Please try our Cobalt Transcribe Speech Recognition Demo for simple evaluation purposes. This demo server is for testing and demonstration purposes only and is not guaranteed to support high availability or high volume.

Which SDKs are available to integrate Cobalt Transcribe into my project?

Cobalt Transcribe uses gRPC to define its APIs. The API is defined as a protobuf schema, and grpc tools can be used to generate client SDKs in several languages, including Python, Go, C++, Java, C#, etc.

Can I use my own models with Cobalt Transcribe?

Cobalt Transcribe models are trained on thousands of hours of data and produce very accurate transcripts over a wide range of different conditions. We provide tools and services that allow our models to be tailored towards your particular use case if additional accuracy is desired. If customers have their own existing Kaldi or wav2vec 2.0 models, Cobalt Transcribe supports the use of those external models.

Product Comparison

What are the benefits of Cobalt Transcribe over other speech-to-text services?

Compared to other services, Cobalt Transcribe offers the following advantages:

You can host the Cobalt Transcribe server on your system locally or in your virtual private cloud. This enables you to keep your data private and secure.
Cobalt Transcribe has low latency. It is particularly useful for embedded devices and real-time applications.
Cobalt Transcribe is highly customizable. Adaptaing the language and acoustic models to your specific terminology will improve performance.
Cobalt’s experienced speech scientists are available to adapt the LM and AM to your target domain for the best recognition results.
The Cobalt Transcribe API supports several outputs: 1-best results, per-word start times and durations, per-word confidences, n-best transcripts, confusion networks, and lattices.