Hybrid vs End-to-End Models

Differences between gen-1 and gen-2 models, and choosing the right model type

Cobalt’s Transcribe engine supports two types of models:

Hybrid (gen-1) - A Hybrid model consists of a sequence of independent models that when chained together can convert audio to words. This type of model no longer produces state-of-the art accuracy but still remains dominant in many commercial ASR applications.
End-to-End (gen-2) - An End-to-End (E2E) model is mostly a single large neural network that can convert audio directly to text transcripts (or something very close that requires little additional processing)

A hybrid model can be viewed as a sequence of several different models glued together in a particular sequence to convert audio to text. The cascade of models will (1) convert audio to features based on the amount of energy in different frequency ranges, (2) use a neural network to predict context-dependent sounds (phones) that are present for every ~10 milliseconds of audio, (3) convert the context-dependent phones to context-endependent phones (ex: the ’th’ sound in ’the’ would be an example phone), (4) convert the sequence of phones into a sequence of words using a lexicon model that is a manually curated dictionary of words and the expected sounds/pronunciations for each word, (5) convert the identified possible sequences of words into the most likely sequence of words using a Language Model trained on a large amount of text (helps to choose between ambiguity like “WRECK A NICE BEACH” vs “RECOGNIZE SPEECH”).

An End-to-End speech recognition model is fairly straightforward by comparison. Rather than a series of small models, the bulk of the transcript is performed by one large neural network model. Depending on the specific E2E architecture, there may be a small amount of light feature generation on the input side of E2E the neural network and a little bit of processing on the output side to put together the transcript, but the decoding process is much more straightforward than the hybrid approach and most of the word is performed in one large neural network.

Selecting between Hybrid and E2E Model Types

Advantages of E2E (gen-2) models:

High Accuracy - Our E2E models push the state-of-the art in speech recognition accuracy in a variety of diverse use cases, and typically produce 30-50% fewer errors than gen-1 models. You can take a look at the word error rates of both hybrid and E2E models on several industry-standard test datasets here.
Sample Rate Flexibility - All of our E2E models can transcribe both 8khz telephone audio and 16khz audio without any loss in accuracy.
Out-of-Vocabulary Word Support - Even words never seen during training can very often be recognized correctly.
Parallel Processing - The transcription of a single audio file can be easily run in parallel across multiple CPUs or on a GPU.
Easier Training - training models is more straightforward (if adapting or fully re-training a model to be optimized for a particular use case).
Low Resource Language Support - Less information is required about a new language to train a model for a new language. The phones/sounds, pronunciations of words, and vocabulary are not required. Usually, much less training data is also be needed to produce a suitable recognition model.

Advantages of Hybrid (gen-1) models:

Low Latency - Our hybrid models can achieve very low latency (<100ms). E2E models can be run with settings that reduce latency, but for those models there is an impact to Word-Error-Rate and the latency will not be as low as the hybrid model.
Efficient Transcription - Each CPU core can transcribe several audio streams at the same time.
Easy Customization - New words and/or pronunciations can be added at transcription time. We also offer tools that allow you to re-build models with your own text-only data that allows the Hybrid model to be more accurate on a target domain.
More suitable for Embedded Devices - Hybrid models can be trained to have relatively light CPU/memory/storage requirements if it needs to run on an embedded device
Constrained Use Cases - If transcription is being deployed in a use case that has a limited vocabulary and/or grammar (not general transcription), the hybrid model can be trained or adapted to target this use case and achieve extremely high accuracy. Examples would be voice command-and-control of a device, or users speaking from a list of commands. Multiple grammars can even be supported and swapped in/out of the recognizer when it is running.
Confidence - Per-word confidence estimates are more accurate for Hybrid models than E2E models.
Alternate Words - Hybrid models can return rich results beyond a 1-best transcript that contains potential alternate words/sentences for the transcribed audio.
Less Compute Required for Training - Training and adapting speech models requires fewer GPU/compute resources.

End-to-End models are likely to be the best choice for customers that are primarily concerned with maximizing accuracy for general transcription. However, the hybrid models may be more appropriate and even more accurate under some conditions: very low latency streaming, low compute/memory embedded transcription, highly custom/unique vocabulary, a very narrow domain (ex: speaking a small number of device directed commands), vocabulary or expected command sets that can change often (even between each audio stream passed as input). By supporting both types of models and offering several different options for model customization, Cobalt is able to satisfy nearly any use case that a customer may require. The support of multiple model types also future-proofs the service by ensuring that if an improved type of speech recognition model becomes available in the future, users of cobalt-transcribe will be able to start using it with minimal changes to their API.

Our gen-2 E2E models currently do not support word level confidence, confusion networks outputs, recognition context, or GPU support during decoding. However, these features will be added to the E2E models soon.