/
Batch Container
/
API Guide

Batch Container API Guide

This guide will walk you through using Speechmatics' v2.7 API in order to use the Speechmatics ASR Batch Container.

For information on getting started and accessing the Speechmatics software repository please refer to Speechmatics Container Quick Start Guide.

Transcription Output Format

The transcript output will consist of:

  • JSON format version (examples can be seen in the sections below)
    • V2.7 - used when the config.json configuration object is used (only supported approach)
  • Diarization information
    • Channel Diarization - channel labelling with relevant transcription in enclosed block
    • Speaker Diarization - identifying who is currently talking by labelling words in the JSON output with a label for each unique speaker
    • Speaker Change - identifying when a different speaker begins talking as an element in the JSON output, but not attempting to label words with their speaker
    • Speaker Change with Channel Diarization - Channel labelling with relevant transcription in enclosed block, speaker change elements additionally output at relevant sections
    • No diarization
  • Header information to show license expiry date
  • A full stop to delimit sentences, irrespective of language being transcribed
  • A word, confidence and timing information for each transcribed word
  • Transcription output additionally in txt or srt format
  • Notification information that can be used to generate callbacks
  • Metadata about the job that was submitted as part of an optional jobInfo file
  • Additional metadata about entities available when requested

Feature Usage

This section explains how to use additional features beyond plain transcription of speech to text.

As part of the Speechmatics' V2.7 API, you must always use the config.json object unless otherwise specified in examples below

Please Note the V1 API is no longer maintained. Using environmental variables to call speech features is neither recommended nor supported except where this document explicitly designates.

Configuration Object

The configuration object allows you to process a file for transcription and optionally use speech features of the container. It is a JSON structure that is passed as a separate volume-mapped file (mapped to /config.json) when carrying out transcription. Here is an example of a command to run the container :

docker run -i -v ~/Projects/ba-test/data/audio.wav:/input.audio \
  -v ~/tmp/config.json:/config.json \
  batch-asr-transcriber-en:9.1.0

The configuration object is mapped to~/tmp/config.json. The command requests transcription in English. for the audio.wav. Below is an example of a config.json file where transcription in English is requested, with no additional speech features.

{
  "type": "transcription",
  "transcription_config": {
    "language": "en"
  }
}

You must always request:

  • the type of request you want. This is always transcription
  • The transcription_config
    • the language of the transcription output you want within the transcription_config. The language code must be in a two-digit ISO639-1 format (e.g. if you want a file in English, the language code is always "en").

N.B Each container can only output one language. Requests for a language other than the one supported will result in an error

The configuration information requested within the config.json file will be shown in the JSON output before any transcript:

{
  "format": "2.7",
  "metadata": {
    "created_at": "2019-03-01T17:21:34.002Z",
    "type": "transcription",
    "transcription_config": {
      "language": "en"
    }
  }

Requesting an enhanced model

Speechmatics supports two different models within each language pack; a standard or an enhanced model. The standard model is the faster of the two, whilst the enhanced model provides a higher accuracy, but a slower turnaround time.

The enhanced model is a premium model. Please contact your account manager or Speechmatics if you would like access to this feature. You will require a new license which will provide you access to the enhanced model.

An example of requesting the enhanced model is below

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "operating_point": "enhanced"
  }
}

Please note: standard, as well as being the default option, can also be explicitly requested with the operating_point parameter.

Enabling Logging for Usage Reporting

The enhanced model is a premium offering. Ensure when capturing information on audio duration for billing information that you capture separately how many hours were processed with the standard model, and how many hours were captured with the enhanced model.

Speaker Separation (Diarization)

Speechmatics offers four different modes for separating out different speakers in the audio:

TypeDescriptionUse Case
speaker diarizationAggregates all audio channels into a single stream for processing and picks out unique speakers based on acoustic matching.Used in cases where there are multiple speakers embedded in the same audio recording and it's required to understand what each unique speaker said.
channel diarizationTranscribes each audio channel separately and treats each channel as a unique speaker.Used when it's possible to record each speaker on separate audio channels.
speaker change (beta)Provides the point in transcription when there is believed to be a new speaker.Used for when you just need to know the speaker has changed usually in a real-time application.
channel diarization & speaker changeTranscribes each audio channel separately and within each channel provides the point when there is believed to be a new speaker.Used when it's possible to record some speakers on a separate audio channel, but some channels there are multiple speakers.

Each of these modes can be enabled by using the diarization config. The following are valid values:

The default value is none - e.g. the transcipt will not be diarized.

TypeConfig Value
speaker diarizationspeaker
channel diarizationchannel
speaker changespeaker_change
channel diarization & speaker changechannel_and_speaker_change

All of the diarization options are requested through the config.json object.

Speaker Diarization

Speaker diarization aggregates all audio channels into a single stream for processing, and picks out different speakers based on acoustic matching.

By default the feature is disabled. To enable speaker diarization the following must be set when you are using the config object:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker"
  }
}

When enabled, every word and punctuation object in the output results will be a given "speaker" property which is a label indicating who said that word. There are two kinds of labels you will see:

  • S# - S stands for speaker and the # will be an incrementing integer identifying an individual speaker. S1 will appear first in the results, followed by S2 and S3 etc.
  • UU - Diarization is disabled or individual speakers cannot be identified. UU can appear for example if some background noise is transcribed as speech, but the diarization system does not recognise it as a speaker.

Note: Enabling diarization increases the amount of time taken to transcribe an audio file. In general we expect diarization to take roughly the same amount of time as transcription does, therefore expect the use of diarization to roughly double the overall processing time.

The example below shows relevant parts of a transcript with 3 speakers. The output shows the configuration information passed in the config.json object and relevant segments with the different speakers in the JSON output. Only part of the transcript is shown here to highlight how different speakers are displayed in the output.


"format": "2.7",
"metadata": {
    "created_at": "2020-07-01T13:26:48.467Z",
    "type": "transcription",
    "transcription_config": {
      "language": "en",
      "diarization": "speaker"
      }
  },
 "results": [
    {
      "alternatives": [
        {
          "confidence": 0.93,
          "content": "hello",
          "language": "en",
          "speaker": "S1"
        }
      ],
      "end_time": 0.51,
      "start_time": 0.36,
      "type": "word"
    },
    {
      "alternatives": [
        {
          "confidence": 1.0,
          "content": "hi",
          "language": "en",
          "speaker": "S2"
        }
      ],
      "end_time": 12.6,
      "start_time": 12.27,
      "type": "word"
    },
    {
      "alternatives": [
        {
          "confidence": 1.0,
          "content": "good",
          "language": "en",
          "speaker": "S3"
        }
      ],
      "end_time": 80.63,
      "start_time": 80.48,
      "type": "word"
    }

In our JSON output, start_time identifies when a person starts speaking each utterance and end_time identifies when they finish speaking.

Speaker diarization tuning

The sensitivity of the speaker detection is set to a sensible default that gives the optimum performance under most circumstances. However, you can change this value based on your specific requirements by using the speaker_sensitivity setting in the speaker_diarization_config section of the job config object, which takes a value between 0 and 1 (the default is 0.5). A higher sensitivity will increase the likelihood of more unique speakers returning. For example, if you see fewer speakers returned than expected, you can try increasing the sensitivity value, or if too many speakers are returned try reducing this value. It's not guaranteed to change since several factors can affect the number of speakers detected. Here's an example of how to set the value:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker",
    "speaker_diarization_config": {
        "speaker_sensitivity": 0.6
    }
  }
}
Speaker diarization post-processing

To enhance the accuracy of our speaker diarization, we make small corrections to the speaker labels based on the punctuation in the transcript. For example if our system originally thought that 9 words in a sentence were spoken by speaker S1, and only 1 word by speaker S2, we will correct the incongruous S2 label to be S1. This only works if punctuation is enabled in the transcript.

Therefore if you disable punctuation, for example by removing all permitted_marks in the punctuation_overrides section of the config.json then expect the accuracy of speaker diarization to vary slightly.

Speaker diarization timeout

Speaker diarization will timeout if it takes too long to run for a particular audio file. Currently the timeout is set to 5 minutes or 0.5 * the audio duration; whichever is longer. For example, with a 2 hour audio file the timeout is 1 hour. If a timeout happens the transcript will still be returned but without the speaker labels set.

If the diarization does timeout you will see an ERROR message in the logs that looks like this:

Speaker diarization took too long and timed out (X seconds).

If a timeout occurs then all speaker labels in the output will be labelled as UU.

Under normal operation we do not expect diarization to timeout, but diarzation can be affected by a number of factors including audio quality and the number of speakers. If you do encounter timeouts frequently then please get in contact with Speechmatics support.

Channel Diarization

Channel diarization allows individual channels in an audio file to be labelled. This is ideal for audio files with multiple channels (up to 6) where each channel is a unique speaker.

By default the feature is disabled. To enable channel diarization the following must be set when you are using the config object:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "channel"
  }
}

The following illustrates an example configuration to enable channel diarization on a 2-channel file that will use labels Customer for channel 1 and Agent for channel 2:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "channel",
    "channel_diarization_labels": ["Customer", "Agent"]
  }
}

For each named channel, the words will be listed in its own labelled block, for example:


  {
  "format": "2.7",
  "metadata": {
    "created_at": "2020-07-01T14:11:43.534Z",
    "type": "transcription",
    "transcription_config": {
      "language": "en",
      "diarization": "channel",
      "channel_diarization_labels": ["Customer", "Agent"]
    }
  },
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.87,
          "content": "Hello",
          "language": "en"
        }
      ],
      "channel": "Customer",
      "end_time": 14.34,
      "start_time": 14.21,
      "type": "word"
    },
    {
      "alternatives": [
        {
          "confidence": 0.87,
          "content": "how",
          "language": "en"
        }
      ],
      "channel": "Agent",
      "end_time": 14.62,
      "start_time": 14.42,
      "type": "word"
    },
    {
      "alternatives": [
        {
          "confidence": 0.87,
          "content": "can",
          "language": "en"
        }
      ],
      "channel": "Agent",
      "end_time": 15.14,
      "start_time": 14.71,
      "type": "word"
    },
    {
      "alternatives": [
        {
          "confidence": 0.79,
          "content": "I",
          "language": "en"
        }
      ],
      "channel": "Agent",
      "end_time": 16.71,
      "start_time": 16.3,
      "type": "word"
    },
    {
      "alternatives": [
        {
          "confidence": 0.67,
          "content": "help",
          "language": "en"
        }
      ],
      "channel": "Agent",
      "end_time": 10.39,
      "start_time": 10.17,
      "type": "word"
    }

Note:

  • Transcript output is provided sequentially by channel. So if you have two channels, all of channel 1 would be output first, followed by all of channel 2, and so on
  • If you specify channel as a diarization option, and do not assign channel_diarization_labels then default labels will be used (channel_1, channel_2 etc)
  • Spaces cannot be used in the channel labels

Speaker Change Detection (beta feature)

This feature allows changes in the speaker to be detected and then marked in the transcript. It does not provide information about whether the speaker is the same as one earlier in the audio.

By default the feature is disabled. The config used to request speaker change detection looks like this:

{
  "type": "transcription",
  "transcription_config": {
    "diarization": "speaker_change",
    "speaker_change_sensitivity": 0.8
  }
}

Note: Speaker change is only visible in the JSON V2 output, so make sure you use the json-v2 format when you retrieve the transcript.

The speaker_change_sensitivity property, if used, must be a numeric value between 0 and 1. It indicates to the algorithm how sensitive to speaker change events you want to make it. A low value will mean that very few changes will be signalled (with higher possibility of false negatives), whilst a high value will mean you will see more changes in the output (with higher possibility of false positives). If this property is not specified, a default of 0.4 is used.

Speaker change elements appear in resulting JSON transcript results array look like this:

{
  "type": "speaker_change",
  "start_time": 0.55,
  "end_time": 0.55,
  "alternatives": []
}

Note: Although there is an alternatives property in the speaker change element it is always empty, and can be ignored. The start_time and end_time properties are always identical, and provide the time when the change was detected.

A speaker change indicates where we think a different person has started talking. For example, if one person says "Hello James" and the other responds with "Hi", there should be a speaker_change element between "James" and "Hi", for example:

{
  "format": "2.7",
  "job": {
....
  "results": [
    {
      "start_time": 0.1,
      "end_time": 0.22,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "Hello",
            "language": "en",
            "speaker": "UU"
        }
      ]
    },
    {
      "start_time": 0.22,
      "end_time": 0.55,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "James",
            "language": "en",
            "speaker": "UU"
        }
      ]
    },
    {
      "start_time": 0.55,
      "end_time": 0.55,
      "type": "speaker_change",
      "alternatives": []
    },
    {
      "start_time": 0.56,
      "end_time": 0.61,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "Hi",
            "language": "en",
            "speaker": "UU"
        }
      ]
    }
  ]
}
  • Note: You can only choose speaker_change as an alternative to speaker or channel diarization.

Speaker Change Detection With Channel Diarization

Speaker change can be combined with channel diarization. It will transcribe each channel separately and indicate in the output each channel (with labels if set) and the speaker changes on each of the channels. For example, if a two-channel audio contains three people greeting each other (with a single speaker on channel 1 and two speakers on channel 2), the config submitted with the audio to request the speaker change detection is:

{
  "type": "transcription",
  "transcription_config": {
    "diarization": "channel_and_speaker_change",
    "speaker_change_sensitivity": 0.8
  }
}

The output will have special elements in the results array between two words where a different person starts talking on the same channel.

{
    "format": "2.7",
    "job": {
....
    },
    "metadata": {
....
    },
    "results": [
    {
      "channel": "channel_2",
      "start_time": 0.1,
      "end_time": 0.22,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "Hello",
            "language": "en",
            "speaker": "UU"
        }
      ]
    },
    {
      "channel": "channel_2",
      "start_time": 0.22,
      "end_time": 0.55,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "James",
            "language": "en",
            "speaker": "UU"
        }
      ]
    },
    {
      "channel": "channel_1",
      "start_time": 0.55,
      "end_time": 0.55,
      "type": "speaker_change",
      "alternatives": []
    },
    {
      "channel": "channel_2",
      "start_time": 0.56,
      "end_time": 0.61,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "Hi",
            "language": "en",
            "speaker": "UU"
        }
      ]
    },
    {
      "channel": "channel_1",
      "start_time": 0.56,
      "end_time": 0.61,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "Hi",
            "language": "en",
            "speaker": "UU"
        }
      ]
    }
  ]
}
  • Note: Do not try to request speaker_change and channel diarization as multiple options: only channel_and_speaker_change is an accepted parameter for this configuration.

Custom dictionary

The Custom Dictionary feature allows a list of custom words to be added for each transcription job. This helps when a specific word is not recognised during transcription. It could be that it's not in the vocabulary for that language, for example a company or person's name. Adding custom words can improve the likelihood they will be output.

The sounds_like feature is an extension to this to allow alternative pronunciations to be specified to aid recognition when the pronunciation is not obvious.

The Custom Dictionary feature can be accessed through the additional_vocab property.

Prior to using this feature, consider the following:

  • sounds_like is an optional setting recommended when the pronunciation is not obvious for the word or it can be pronounced in multiple ways; it is valid just to provide the content value
  • sounds_like only works with the main script for that language
    • Japanese (ja) sounds_like only supports full width Hiragana or Katakana
  • You can specify up to 1000 words or phrases (per job) in your custom dictionary
"transcription_config": {
  "language": "en",
  "additional_vocab": [
    {
      "content": "gnocchi",
      "sounds_like": [
        "nyohki",
        "nokey",
        "nochi"
      ]
    },
    {
      "content": "CEO",
      "sounds_like": [
        "C.E.O."
      ]
    },
    {
      "content": "financial crisis"
    }
  ]
}

In the above example, the words gnocchi and CEO have pronunciations applied to them; the phrase financial crisis does not require a pronunciation. The content property represents how you want the word to be output in the transcript.

Using the Shared Custom Dictionary Cache

Processing a large custom dictionary repeatedly can be CPU consuming and inefficient. The Speechmatics Batch Container includes a cache mechanism for custom dictionaries to limit excessive resource use. By using this cache mechanism, the container can reduce the overall time needed for speech transcription when repeatedly using the same custom dictionaries. You will see performance benefits on re-using the same custom dictionary from the second time onwards.

It is not a requirement to use the shared cache to use the Custom Dictionary.

The cache volume is safe to use from multiple containers concurrently if the operating system and its filesystem support file locking operations. The cache can store multiple custom dictionaries in any language used for batch transcription. It can support multiple custom dictionaries in the same language.

If a custom dictionary is small enough to be stored within the cache volume, this will take place automatically if the shared cache is specified.

For more information about how the shared cache storage management works, please see Maintaining the Shared Cache.

We highly recommend you ensure any location you use for the shared cache has enough space for the number of custom dictionaries you plan to allocate there. How to allocate custom dictionaries to the shared cache is documented below.

How to set up the Shared Cache

The shared cache is enabled by setting the following value when running transcription:

  • Cache Location: You must volume map the directory location you plan to use as the shared cache to /cache when submitting a job
  • SM_CUSTOM_DICTIONARY_CACHE_TYPE: (mandatory if using the shared cache) This environment variable must be set to shared
  • SM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE: (optional if using the shared cache). This determines the maximum size of any single custom dictionary that can be stored within the shared cache in bytes
    • E.G. a SM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE with a value of 10000000 would set a total storage size of 10MB
    • For reference a custom dictionary wordlist with 1000 words produces a cache entry of size around 200 kB, or 200000 bytes
    • A value of -1 will allow every custom dictionary to be stored within the shared cache. This is the default assumed value
    • A custom dictionary cache entry larger than the SM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE will still be used in transcription, but will not be cached

Maintaining the Shared Cache

If you specify the shared cache to be used and your custom dictionary is within the permitted size, Speechmatics Batch Container will always try to cache the custom dictionary. If a custom dictionary cannot occupy the shared cache due to other cached custom dictionaries within the allocated cache, then older custom dictionaries will be removed from the cache to free up as much space as necessary for the new custom dictionary. This is carried out in order of the least recent custom dictionary to be used.

Therefore, you must ensure your cache allocation large enough to handle the number of custom dictionaries you plan to store. We recommend a relatively large cache to avoid this situation if you are processing multiple custom dictionaries using the batch container (e.g 50 MB). If you don't allocate sufficient storage this could mean one or multiple custom dictionaries are deleted when you are trying to store a new custom dictionary.

It is recommended to use a docker volume with a dedicated filesystem with a limited size. If a user decides to use a volume that shares filesystem with the host, it is the user's responsibility to purge the cache if necessary.

Creating the Shared Cache

In the example below, transcription is run where an example local docker volume is created for the shared cache. It will allow a custom dictionary of up to 5MB to be cached.

docker volume create speechmatics-cache

docker run -i -v /home/user/sm_audio.wav:/input.audio \
  -v /home/user/config.json:/config.json:ro \
  -e SM_CUSTOM_DICTIONARY_CACHE_TYPE=shared \
  -e SM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE=5000000 \
  -v speechmatics-cache:/cache \
  -e LICENSE_TOKEN=f787b0051e2768bcee3231f619d75faab97f23ee9b7931890c05f97e9f550702 \
  batch-asr-transcriber-en:9.1.0

Viewing the Shared Cache

If all set correctly and the cache was used for the first time, a single entry in the cache should be present.

The following example shows how to check what Custom Dictionaries are stored within the cache. This will show the language, the sampling rate, and the checksum value of the cached dictionary entries.

ls $(docker inspect -f "{{.Mountpoint}}" speechmatics-cache)/custom_dictionary
en,16kHz,db2dd9c0d10faa8006d8a3fabc86aef6b6e27b3ccbd2a945d3aae791c627f0c5

Reducing the Shared Cache Size

Cache size can be reduced by removing some or all cache entries.

rm -rf $(docker inspect -f "{{.Mountpoint}}" speechmatics-cache)/custom_dictionary/*
Manually purging the cache

Before manually purging the cache, ensure that no containers have the volume mounted, otherwise an error during transcription might occur. Consider creating a new docker volume as a temporary cache while performing purging maintenance on the cache.

Output Locale

It is possible to optionally specify the language locale to be used when generating the transcription output, so that words are spelled correctly, for cases where the model language is generic and doesn't already imply the locale.

The following locales are supported in the Global English language pack:

  • en-AU: supports Australian English
  • en-GB: supports British English
  • en-US: supports American English

The output_locale configuration setting is used for this. As an example, the following configuration uses the Global English (en) language pack with an output locale of British English (en-GB):

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "output_locale": "en-GB"
  }
}

The following locales are supported for Chinese Mandarin. The default is simplified Mandarin.

  • Simplified Mandarin (cmn-Hans)
  • Traditional Mandarin (cmn-Hant)

Advanced punctuation

All Speechmatics language packs support Advanced Punctuation. This uses machine learning techniques to add in more naturalistic punctuation, improving the readability of your transcripts.

The following punctuation marks are supported for each language:

Language(s)Supported PunctuationComment
Cantonese, Mandarin, 。 ? ! 、Full-width punctuation supported
Japanese。 、Full-width punctuation supported
Hindi। ? , !
All other languages. , ! ?

If you do not want to see any of the supported punctuation marks in the output, then you can explicitly control this through the punctuation_overrides settings, for example:

"transcription_config": {
   "language": "en",
   "punctuation_overrides": {
      "permitted_marks":[ ".", "," ]
   }
}

This will exclude exclamation and question marks from the returned transcript.

All Speechmatics output formats support Advanced Punctuation. JSON output places punctuation marks in the results list marked with a type of "punctuation".

Note: Disabling punctuation may slightly harm the accuracy of speaker diarization. Please see the "Speaker diarization post-processing" section in these docs for more information.

Notifications

Speechmatics allows customers to receive callbacks to a web service they control. Speechmatics will then make a HTTP POST request once the transcription is available. If you wish to enable notifications, you must add the notification_config only as part of the config.json object. This is separate to the transcription_config. The following parameters are available:

  • url: (mandatory) The URL to which a notification message will be sent upon completion of the job.
  • contents: (optional) Specifies a list of item(s) to be attached to the notification message. If you only want to receive a simple notification with no transcript or other data attached **ensure that the value here is [] rather than empty. An example is provided in our Technical Migration Guide If only one item is listed, it will be sent as the body of the request with Content-Type set to an appropriate value such as application/octet-stream or application/json. If multiple items are listed they will be sent as named file attachments using the multipart content type. Examples of what can be sent include the following:
    • jobinfo: A summary of the job. This will only be provided if you provide a jobinfo.json file when submitting a file for transcription. Please see the relevant section for information
    • transcript: The transcript in json-v2 format
    • transcript.json-v2: The transcript in json-v2 format.
    • transcript.txt: The transcript in txt format.
    • transcript.srt: The transcript in srt format.
  • method: (optional) the method to be used with HTTP and HTTPS URLs. If no option is chosen, the default is POST, but PUT is also supported.
  • auth_headers: (optional) A list of additional headers to be added to the notification request when using http or https. This is intended to support authentication or authorization, for example by supplying an OAuth2 bearer token.

If you want to upload content directly to an object store, for example Amazon S3, you must ensure that the URL grants the Speechmatics container appropriate permissions when carrying out notifications. Pre-authenticated URLs, generated by an authorsed user, allow non-trusted devices access to upload to access stores. AWS carries this out via generating pre-signed URLs. Microsoft Azure allows similar acess via Shared Access Signatures.

Please see the section [How to transcribe files stored online](### How to transcribe files stored online) for details of how to pull files from online storage locations for transcription, and more information on pre-authenticated URLs

An example request for transcription in English with notification_config is shown below:

{
     "type": "transcription",
     "transcription_config": { "language": "en" },
     "notification_config": [
       {
         "url": "https://collector.example.org/callback",
         "contents": [ "transcript", "data" ],
         "auth_headers": ["Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhb"]
       }
     ]
   }

If the callback is unsuccessful, it will repeat up to three times in total. If, after three times, it is still unsuccessful, it will process only the transcript via STDOUT.

How to generate multiple transcript formats

In addition to our primary JSON format, the Speechmatics container can output transcripts in the plain text (TXT) and SubRip (SRT) subtitle format. This can be done by using --all-formats command and then specifying <$EXAMPLE_DIRECTORY> parameter within the transcription request. The <$EXAMPLE_DIRECTORY> is where all supported transcript formats will be saved. Users can also use --allformats to generate the same response.

This directory must be mounted into the container so the transcripts can be retrieved after container finishes. You will receive a transcript in all currently supported formats: JSON, TXT, and SRT.

The following example shows how to use --all-formats parameter. In this scenario, after processing the file, three separate transcripts would be found in the ~/tmp/output directory. These transcripts would be in JSON, TXT, and SRT format.

docker run \
  -v ~/Projects/ba-test/data/shipping-forecast.wav:/input.audio \
  -v ~/tmp/config.json:/config.json \
  -v ~/tmp/output:/example_output_dir_name \
  batch-asr-transcriber-en:9.1.0 \
  --all-formats /example_output_dir_name

SubRip Subtitles

SubRip (SRT) is a subtitling format that can be used in to generate subtitles for video content or other workflows. Our SRT output will generate a transcript together with corresponding alignment timestamps. We follow best practice as recommended by major broadcasters in our default line length and number of lines output.

You can change the maximum number of lines supported, and the maximum character space within a line, by using configuration options as part of the output_config, which is part of the overall config.json object described below:

{
  "type": "transcription",
  "transcription_config": {
    ...
  },
  "output_config": {
    "srt_overrides": {
      "max_line_length": 37,
      "max_lines": 2
    }
  }
}
  • max_line_length: sets maximum count of characters per subtitle line including white space (default: 37).
  • max_lines: sets maximum count of lines in a subtitle section (default: 2).

URL Fetching

If you want to access a file stored in cloud storage, for example AWS S3 or Azure Blob Storage, you can use the fetch_data parameter within the config.json object. The fetch_data parameter specifies a cloud storage location.

You must ensure the URL you provide grants Speechmatics appropriate privileges to access the necessary files, otherwise this will result in a transcription error. Cloud providers like AWS and Azure allow temporary access to non-privileged parties to access and upload objects to cloud storage via generation of authenticated URLs by an authorised user. AWS recommends using pre-signed URLs to grant access when accessing objects from and uploading to S3. Azure recommends use of shared access signatures when accessing from and uploading to Azure Storage. Speechmatics supports both of these options

A pre-generated URL will contain authorization parameters within the URL. These can include information about how long the URL is valid for and what permissions access to the URL enables. More information is present on the page of each cloud provider

To successfully call data objects stored online using the Speechmatics container you must use the following parameters:

  • url: (mandatory if you want to access an online file) the location of the file
  • auth_headers: (optional) If your cloud storage solution requires authentication. The auth_headers parameter provides the headers necessary to access the resource. This is intended to support authentication or authorization when using http or https, for example by supplying an OAuth2 bearer token

An example is below:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en"
  },
  "fetch_data": {
    "url": "https://example.s3.amazonaws.com/folder/file.mp3?&AWSAccessKeyId=...&Expires=...&Signature=...",
    "auth_headers": ["Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhb"]
  }
}

How to track a file

The jobInfo file

You can optionally submit additional information to the batch container that can then be used as further or tracking metadata. To do so you must submit a jobInfo file as a sepatate json object. This file is separate to the config.json object when submitting a request. The jobInfo file must include a unique id, the name and duration of the data file, and the UTC date the job was created. This information is then available in job results and in callbacks.

When using a jobInfo file you must submit the following mandatory properties:

  • created_at - The UTC time the job was created at. An example is "2019-01-17T17:50:54.113Z"
  • data_name - The name of the file submitted as part of the job. An example is example.wav. This does not need to match the actual file name
  • duration - The length of the audio file. This must be an integer value in seconds and must be at least 0
  • id - A customer-unique ID that is assigned to a job. This is not a value provided by Speechmatics

Optional Metadata

You may also submit the following optional properties as part of metadata tracking. These are properties that are unique to your organisation that you may wish to or are required to track through a company workflow or where you are processing large amounts of files. This information will then be available in the jobInfo output and in notification callbacks:

  • tracking - Parent of the following child properties. If you are submitting metadata for tracking this must be included
    • title - The title of the job
    • reference - External system reference
    • tag - Any tags by which you associate files or data
    • details - Customer-defined JSON structure. These can include information valuable to you about the job

An example jobInfo.json file is below, with optional metadata inserted

{
  "created_at": "2020-06-26T12:12:24.625Z",
  "data_name": "example_file",
  "duration": 5,
  "id": "1",
  "tracking": {
    "title": "ACME Q12018 Statement",
    "reference": "/data/clients/ACME/statements/segs/2018Q1-seg8",
    "tags": [
      "quick-review",
      "segment"
    ],
    "details": {
      "client": "ACME Corp",
      "segment": 8,
      "seg_start": 963.201,
      "seg_end": 1091.481
    }
  }
}

Running the JobInfo file

Here is an example of processing a file on the batch container with an example jobInfo file:

docker run -v /PATH/TO/FILE/jobInfo.json:/jobInfo.json \
  -v /PATH/TO/FILE/config.json:/config.json \ 
  -v /PATH/TO/FILE/audio.wav:/input.audio \
  -e LICENSE_KEY=$license batch-asr-transcriber-en:9.1.0

jobInfo Output Example

Here is an example of the json output when using a jobInfo file, with the first word of the transcript. You can see the output is divided into several sections:

  • The license information, including the time of build and number of days remaining
  • The information present in the jobInfo file, including any metadata or tracking information
  • The configuration information presented in the config.json file
  • The results of the transcript, including the word, confidence score, diarization information etc.
{
  "format": "2.7",
  "job": {
    "created_at": "2020-07-01T12:46:34.393Z",
    "data_name": "example.wav",
    "duration": 128,
    "id": "1",
    "tracking": {
      "details": {
        "client": "ACME Corp",
        "segment": 8,
        "seg_start": 963.201,
        "seg_end": 1091.481
      },
      "reference": "/data/clients/ACME/statements/segs/2018Q1-seg8",
      "tags": [
        "quick-review",
        "segment"
      ],
      "title": "ACME Q12018 Statement"
    }
  },
  "metadata": {
    "created_at": "2020-07-01T12:47:28.470Z",
    "type": "transcription",
    "transcription_config": {
      "language": "en",
      "diarization": "speaker"
    }
  },
  "results": [
    {
      "alternatives": [
        {
          "confidence": 1.0,
          "content": "This",
          "language": "en",
          "speaker": "S1"
        }
        ],
      "end_time": 1.98,
      "start_time": 1.86,
      "type": "word"
    }
  ]
}

NB When using the jobInfo file the format output will show 2 created_at parameters. The created_at under job is when the file was submitted for transcription The createdDate under metadata is when the output was produced. The time difference between the two provides the total transcription time, including any system delays as well as the actual time taken to process the job.

Word Tagging

Profanity Tagging

Speechmatics now outputs in JSON transcript only a metadata tag to indicate whether a word is a profanity or not. This is for the following languages:

  • English (EN)
  • Italian (IT)
  • Spanish (ES)

The list of profanities is not alterable. Users do not have to take any action to access this - it is provided in our JSON output as standard Customers can use this tag for their own post-processing in order to identify, redact, or obfuscate profanities and integrate this data into their own workflows. An example of how this looks is below.

"results": [
{
      "alternatives": [
        {
          "confidence": 1.0,
          "content": "$PROFANITY",
          "language": "en",
          "speaker": "UU",
          "tags": [
            "profanity"
          ]
        }
      ],
      "end_time": 18.03,
      "start_time": 17.61,
      "type": "word"
    }
]

Disfluency Tagging

Speechmatics now outputs in JSON transcript only a metadata tag to indicate whether a word is a disfluency or not in the English language only. A disfluency here refers to a set list of words in English that imply hesitation or indecision. Please note while disfluency can cover a range of items like stuttering and interjections, here it is only used to tag words such as 'hmm' or 'umm'. Users do not have to take any action to access this - it is provided in our JSON output as standard Customers can use this tag for their own post-processing workflows. An example of how this looks is below:

"results": [
{
      "alternatives": [
        {
          "confidence": 1.0,
          "content": "hmm",
          "language": "en",
          "speaker": "UU",
          "tags": [
            "disfluency"
          ]
        }
      ],
      "end_time": 18.03,
      "start_time": 17.61,
      "type": "word"
    }
]

Domain Language Packs

Some Speechmatics language packs are optimized for specific domains where high accuracy for specific vocabulary and terminology is required. Using the domain parameter provides additional transcription accuracy, and must be used in conjunction with a standard language pack (this is currently limited to the "finance" domain and supports the "en" language pack). An example of how this looks is below:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "domain": "finance"
  }
}

These domain language packs are built on top of our global language packs so give the highest accuracy in different acoustic environments that our customers have come to expect.

Please note that if you are using the "Finance" domain language pack you will need to use the "en-finance" container image, located at speechmatics-docker-public.jfrog.io/batch-asr-transcriber-en-finance. More details about how to pull container images can be found here

It is expected that whilst there will be improvements for the specific domain there can be degradation in accuracy for other outside domains.

Full API Reference

Below are the full API references for the config.json and the jobInfo.json files.

config.json API Reference

The config.json is constructed of multiple configuration settings, each of which is responsible for a separate section of transcription output. All configuration settings are passed within the type object Only transcription_config is mandatory.

  • type (Mandatory): Within type you must pass all other config information
  • transcription_config: (Mandatory) Information about what language and features you want to use in the batch container
  • fetch_data: (Optional) If you wish to transcribe a file stored online, you may pass this within the config.json file
  • notification_config: (Optional) If you want to use callbacks, this documents where and how they are sent
  • output_config: (Optional) If you want to retrieve files in SRT format, and you want to alter the default settings in how SRT appears only.

transcription_config

NameTypeDescriptionRequired
languagestringLanguage model to process the audio input, normally specified as an ISO language codeYes
domainstringRequest a specialized language pack optimized for a particular domain, e.g. "finance". Domain is only supported for selected languages.No
additional_vocab[object]List of custom words or phrases that should be recognized. Alternative pronunciations can be specified to aid recognition.No
punctuation_overrides[object]Control punctuation settings. Only valid with languages that support advanced punctuation. These are Arabic, Danish, Dutch, English, French, German, Malay, Spanish, Swedish and Turkish.No
diarizationstringThe default is none. You may specify options of speaker, channel,speaker_change, channel_and_speaker_change, or noneNo
speaker_diarization_configSpeakerDiarizationConfigConfiguration for speaker diarization. Includes speaker_sensitivity: Range between 0 and 1. A higher sensitivity will increase the likelihood of more unique speakers returning. For example, if you see fewer speakers returned than expected, you can try increasing the sensitivity value or if too many speakers are returned try reducing this value. The default is 0.5.No
speaker_change_sensitivityfloatUsed for the speaker change feature. Range between 0 and 1. Controls how responsive the system is for potential speaker changes. High value indicates high sensitivity. Defaults to 0.4.No
channel_diarization_labels[string]Transcript labels to use when using collating separate input channels. Only applicable when you have selected channel as a diarization optionNo
output_localestringOnly applicable with global English. Correct maps words to local spellings. Options are, en-AU, en-GB, or en-USNo
operating_pointstringSpecify whether to use a standard or enhanced model for transription. By default the model used is standardNo
enable_entitiesBooleanSpecify whether to enable entity types within JSON output, as well as additional spoken_form and written_form metadata. By default falseNo

fetch_data

NameTypeDescriptionRequired
urlstringThe online location of the file.Yes
auth_headers[string]A list of additional headers to be added to the input fetch request when using http or https. This is intended to support authentication or authorization, for example by supplying an OAuth2 bearer token.No

speaker_diarization_config

Additional configuration for the Speaker Diarization feature.

NameTypeDescriptionRequired
speaker_sensitivityfloatUsed for speaker diarization feature. Range between 0 and 1. A higher sensitivity will increase the likelihood of more unique speakers returning. For example, if you see fewer speakers returned than expected, you can try increasing the sensitivity value, or if too many speakers are returned try reducing this value. The default is 0.5.No

notification_config

NameTypeDescriptionRequired
urlstringThe url to which a notification message will be sent upon completion of the job. If only one item is listed, it will be sent as the body of the request with Content-Type set to an appropriate value such as application/octet-stream or application/json. If multiple items are listed they will be sent as named file attachments using the multipart content type. If contents is not specified, the transcript item will be sent as a file attachment named data_file, for backwards compatibility. If the job was rejected or failed during processing, that will be indicated by the status, and any output items that are not available as a result will be omitted. The body formatting rules will still be followed as if all items were available. The user-agent header is set to Speechmatics API V2 in all cases.Yes
content[string]Specifies a list of items to be attached to the notification message. When multiple items are requested, they are included as named file attachments.No
methodstringThe method to be used with http and https urls. The default is POST.No
auth_headers[string]A list of additional headers to be added to the input fetch request when using http or https. This is intended to support authentication or authorization, for example by supplying an OAuth2 bearer token.No

output_config

NameTypeDescriptionRequired
srt_overridesobjectParameters to override the defaults for SubRip (srt) subtitle format. - max_line_length: sets maximum count of characters per subtitle line including white space (default: 37). -max_lines: sets maximum number of lines per subtitle segment (default: 2).No

jobInfo reference

NameTypeDescriptionRequired
createdAtdateTimeThe UTC date time the job was created.Yes
data_namestringName of the data file submitted for job.No
durationintegerThe file duration (in seconds).No
trackingobjectAdditional tracking informationNo

tracking metadata within the jobInfo file

The following information can be passed within the tracking object as part of the jobInfo file

NameTypeDescriptionRequired
titlestringThe title of the job.No
referencestringExternal system reference.No
tags[string]A set of keywordsNo
detailsobjectCustomer-defined JSON structure.No