This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Recognition Configurations

Describes how to configure requests to Transcribe server.
  • An in-depth explanation of the methods, data structures and types in the auto-generated SDKs can be found in the API Reference section. The sub-section on the RecognitionConfig object is particularly important here. This page discusses the common combinations of values set in RecognitionConfig sent to the server.

  • First, here’s a quick overview of the fields in RecognitionConfig.

Field Required Default Description
model_id Yes - Unique ID of the model to use.
audio_format_raw Yes for raw audio - Can be used to specify the details of raw audio samples recorded from a microphone stream, for example.
audio_format_headered No UNSPECIFIED Can be used when audio has a self-describing header such as WAV, FLAC, MP3, OPUS etc. If not set, transcribe-server will try to auto-detect the audio encoding from the header.
selected_audio_channels No [0] (mono) Specifies which channels of a multi-channel audio file to be transcribed, each as their own individual audio stream.
selected_audio_channels No [0] (mono) Specifies which channels of a multi-channel audio file to be transcribed, each as their own individual audio stream.
audio_time_offset_ms No 0 Can be used to indicate that the audio being streamed to the recognizer is offset from the original stream by the provided duration in milliseconds. This offset will be added to all timestamps in results returned by the recognizer.
enable_confusion_network No false Toggles the inclusion of a confusion network, consisting of multiple alternative transcriptions. The specified model must also support confusion networks for this field to be populated.
metadata No "" Can be used to send any custom metadata associated with the audio being sent.The server may record this metadata when processing the request. The server does not use this field for any other purpose.
context No nil Can be used to provide any context information that can aid speech recognition, such as probable phrases or words that may appear in the recognition output or even out of vocabulary words for the model being used. Currently all context information must first be pre-compiled via the CompileContext().

Use cases

Transcribing Headered Files

  • The most basic use case is getting a formatted transcript for a headered audio file such foo.wav. This would simply need a config such as the following:
{
    "model_id": "1",
}
  • Transcribe will return one or more results depending on partial result frequency, end points in speech etc. each of which would like the following:
{
  "error": null,
  "result": {
    "alternatives": [
      {
        "transcript_formatted": "Tomorrow is a new day.",
        "transcript_raw": "TOMORROW IS A NEW DAY",
        "start_time_ms": 180,
        "duration_ms": 1425,
        "confidence": 0.870,
      },
      {
        "transcript_formatted": "Tomorrow is a you day.",
        "transcript_raw": "TOMORROW IS A YOU DAY",
        "start_time_ms": 180,
        "duration_ms": 1425,
        "confidence": 0.130,
      }
      // ...
      // Other alternative hypotheses.
      // ...
    ]
  }
}
  • If some sort of non-fatal error was encountered, then Transcribe will populate the error field. One such case maybe sending audio sampled at a lower sample rate than what the model is configured for (e.g. sending 8 kHz audio to a 16 kHz model):
{
  "error": {
    "message": "potential accuracy loss: input sample rate (8000) is lower than required (16000)"
  },
  "results": {
    // ...
    // Results
    // ...
  },
}

Transcribing Raw Audio Stream

  • For transcribing raw audio streams, such as those coming in from a live microphone, the details of the audio samples such as their sampling rate, encoding etc. must be specified in the RecognitionConfig like so:
{
    "model_id": "1",
    "audio_format_raw": {
      encoding="SIGNED",
      bit_depth=16,
      byte_order="LITTLE_ENDIAN",
      sample_rate=16000,
      channels=1,
    }
}
  • For various other encoding formats for raw samples, check AudioFormatRaw in the API specification.

Getting Word-level Details

  • If you need to know the word-level details such as word timestamps, to align subtitles with a video, for example, then you can use the following config to enable those word-level timestamps.
{
    "model_id": "1",
    "enable_word_details": true
}
  • Each alternative hypothesis in the returned results will have a word_details field containing details for both formatted and raw words:
{
  "error": null,
  "result": {
    "alternatives": [
      {
        "transcript_formatted": "Tomorrow is a new day.",
        "transcript_raw": "TOMORROW IS A NEW DAY",
        "start_time_ms": 180,
        "duration_ms": 1425,
        "confidence": 0.870,
        "word_details": {
          "formatted": [
            { "word": "Tomorrow", "confidence": 1.0, "start_time_ms": 180, "duration_ms": 800 },
            { "word": "is", "confidence": 1.0, "start_time_ms": 980, "duration_ms": 120 },
            { "word": "a", "confidence": 1.0, "start_time_ms": 1100, "duration_ms": 120 },
            { "word": "new", "confidence": 0.870, "start_time_ms": 1220, "duration_ms": 210 },
            { "word": "day.", "confidence": 1.0, "start_time_ms": 1450, "duration_ms": 155 },          
          ],
          "raw": [
            { "word": "TOMORROW", "confidence": 1.0, "start_time_ms": 180, "duration_ms": 800 },
            { "word": "IS", "confidence": 1.0, "start_time_ms": 980, "duration_ms": 120 },
            { "word": "A", "confidence": 1.0, "start_time_ms": 1100, "duration_ms": 120 },
            { "word": "NEW", "confidence": 0.870, "start_time_ms": 1220, "duration_ms": 210 },
            { "word": "DAY", "confidence": 1.0, "start_time_ms": 1450, "duration_ms": 155 },          
          ],
        }
      },
      // ...
      // Other alternative hypotheses.
      // ...
    ]
  }
}

Getting Confusion Networks

  • For applications that need more than the one-best transcription, the most comprehensive and detailed results are found in the confusion network. Please refer to the in-depth confusion network documentation to see what is included.

  • To enable the confusion network, the following config can be used:

{
    "model_id": "1",
    "enable_confusion_network": true
}
  • The confusion network will be accessible at the cnet field in the results returned:
{
  "error": null,
  "result": {
    "alternatives": [
        {
          "transcript_formatted": "Tomorrow is a new day.",
          "transcript_raw": "TOMORROW IS A NEW DAY",
          "start_time_ms": 180,
          "duration_ms": 1425,
          "confidence": 0.870,
        },
        {
          "transcript_formatted": "Tomorrow is a you day.",
          "transcript_raw": "TOMORROW IS A YOU DAY",
          "start_time_ms": 180,
          "duration_ms": 1425,
          "confidence": 0.130,
        }
        // ...
        // Other alternative hypotheses.
        // ...
    ],
    "cnet": {
      "links": [
        { 
          "start_time_ms": 180,
          "duration_ms": 800,
          "arcs": [
            { "word": "TOMORROW", "confidence": 1.0 }
          ]
        },
        { 
          "start_time_ms": 980,
          "duration_ms": 120,
          "arcs": [
            { "word": "IS", "confidence": 1.0 }
          ]
        },
        { 
          "start_time_ms": 1100,
          "duration_ms": 120,
          "arcs": [
            { "word": "A", "confidence": 1.0 }
          ]
        },
        { 
          "start_time_ms": 1220,
          "duration_ms": 210,
          "arcs": [
            { "word": "NEW", "confidence": 0.870 },
            { "word": "YOU", "confidence": 0.130 }
          ]
        },
        { 
          "start_time_ms": 1450,
          "duration_ms": 155,
          "arcs": [
            { "word": "DAY", "confidence": 1.0 }
          ]
        }        
      ]
    }
  }
}