This guide will walk you through using Speechmatics' v2.4 API in order to invoke features of the Speechmatics Batch Container.
For information on getting started and accessing the Speechmatics software repository please refer to Speechmatics Container Quick Start Guide.
The transcript output will consist of:
config.json
configuration object is used (only supported approach)jobInfo
fileThis section explains how to use additional features beyond plain transcription of speech to text.
As part of the Speechmatics' V2.4 API, you must use the config.json
object unless otherwise specified in examples below
Please Note the V1 API is no longer maintained. Using environmental variables to call speech features is neither recommended nor supported except where this document explicitly designates.
The config object, if used, is a JSON structure that is passed as a separate volume-mapped file (mapped to /config.json
) when carrying out transcription like this:
docker run -i -v ~/Projects/ba-test/data/shipping-forecast.wav:/input.audio \
-v ~/tmp/config.json:/config.json \
speechmatics-docker-example.jfrog.io/transcriber-en:7.0.0
Here is a simple example of a config object file (~/tmp/config.json
from the above example). It requests transcription in English and lists additional custom dictionary words as part of the additional_vocab
property:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"additional_vocab": ["Met Office", "Fitzroy", "Forties"]
}
}
The transcript output will also show the configuration information within the config.json
file, as shown below:
{
"format": "2.4",
"license": "productsteam build (Thu May 14 14:33:09 2020): 953 days remaining",
"metadata": {
"created_at": "2019-03-01T17:21:34.002Z",
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "none",
"additional_vocab": [
{
"content": "Met Office"
},
{
"content": "Fitzroy"
},
{
"content": "Forties"
}
]
}
},
Diarization is the ability to identify a speaker in an audio file. This identification is only related to a single audio file only. For audio files that contain multiple channels or streams, it is possible to use channel diarization and apply custom labels to each channel or stream. If your audio file contains only a single channel or stream then you should choose speaker diarization. By default, containers will transcribe a file with diarization disabled. In the JSON output, files without diarisation requested will always show the speaker as 'UU'. Users can also use speaker_change to allow changes in the speaker to be detected and then marked in the transcript. Detection of speaker change is done without identifying which segments were spoken by the same speaker. Users can also combine speaker_change with channel diarization to identify both channel and speaker changes.
Note: Enabling diarization increases the amount of time taken to transcribe an audio file. The amount of time will vary depending on the length of the file.
To enable speaker diarization the following must be set when you are using the config object:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "speaker"
}
}
When enabled, the output will contain the speaker Identifiers, these are explained below:
M#
- Identifies a male speaker. The # will be a number identifying an individual male speakerF#
- identified a female speaker. The # will be a number identifying an individual female speakerUU
- Speaker is not identified (or diarization is disabled)The example below shows relevant parts of a transcript with 3 male speakers. The output shows the configuration information passed in the config.json
object and relevant segments with the different speakers in the JSON output. Only part of the transcript is shown here to highlight how different speakers are displayed in the output.
"format": "2.4",
"license": "productsteam build (Thu May 14 14:33:09 2020): 953 days remaining",
"metadata": {
"created_at": "2020-07-01T13:26:48.467Z",
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "speaker"
}
},
"results": [
{
"alternatives": [
{
"confidence": 0.93,
"content": "You",
"language": "en",
"speaker": "M2"
}
],
"end_time": 0.51,
"start_time": 0.36,
"type": "word"
},
{
"alternatives": [
{
"confidence": 1.0,
"content": "When",
"language": "en",
"speaker": "M1"
}
],
"end_time": 12.6,
"start_time": 12.27,
"type": "word"
},
{
"alternatives": [
{
"confidence": 1.0,
"content": "And",
"language": "en",
"speaker": "M3"
}
],
"end_time": 80.63,
"start_time": 80.48,
"type": "word"
}
In our JSON output, start_time
identifies when a person starts speaking and end_time
identifies when they finish speaking.
Channel diarization allows individual channels in an audio file to be labelled. This is ideal for audio files with multiple channels (up to 6). By default the feature is disabled. The following information is required within the config.json
object to enable channel diarization on a 2-channel file that will use labels Customer
and Agent
:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "channel",
"channel_diarization_labels": ["Customer", "Agent"]
}
}
If the config object file is called config.json
then you would start the transcription job like this:
docker run -i -v ~/Projects/ba-test/data/shipping-forecast.wav:/input.audio \
-v ~/tmp/config.json:/config.json \
speechmatics-docker-example.jfrog.io/transcriber-en:7.0.0
For each named channel, the words will be listed in its own labelled block, for example:
{
"format": "2.4",
"license": "productsteam build (Thu May 14 14:33:09 2020): 953 days remaining",
"metadata": {
"created_at": "2020-07-01T14:11:43.534Z",
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "channel"
}
},
"results": [
{
"alternatives": [
{
"confidence": 0.87,
"content": "I",
"language": "en"
}
],
"channel": "channel_1",
"end_time": 14.34,
"start_time": 14.21,
"type": "word"
},
{
"alternatives": [
{
"confidence": 0.87,
"content": "would",
"language": "en"
}
],
"channel": "channel_1",
"end_time": 14.62,
"start_time": 14.42,
"type": "word"
},
{
"alternatives": [
{
"confidence": 0.87,
"content": "love",
"language": "en"
}
],
"channel": "channel_1",
"end_time": 15.14,
"start_time": 14.71,
"type": "word"
},
{
"alternatives": [
{
"confidence": 0.79,
"content": "to",
"language": "en"
}
],
"channel": "channel_1",
"end_time": 16.71,
"start_time": 16.3,
"type": "word"
},
{
"alternatives": [
{
"confidence": 0.67,
"content": "To",
"language": "en"
}
],
"channel": "channel_2",
"end_time": 10.39,
"start_time": 10.17,
"type": "word"
},
{
"alternatives": [
{
"confidence": 0.64,
"content": "the",
"language": "en"
}
],
"channel": "channel_2",
"end_time": 10.68,
"start_time": 10.52,
"type": "word"
},
{
"alternatives": [
{
"confidence": 0.71,
"content": "unknown",
"language": "en"
}
],
"channel": "channel_2",
"end_time": 11.27,
"start_time": 10.75,
"type": "word"
}
Note:
channel
as a diarisation option, and do not assign channel_diarization_labels then default labels will be used (channel_1, channel_2 etc)This feature allows changes in the speaker to be detected and then marked in the transcript. Typically it is used to make some changes in the user interface to indicate to the reader that someone else is talking. Detection of speaker change is done without detecting which segments were spoken by the same speaker. The config used to request speaker change detection looks like this:
{
"type": "transcription",
"transcription_config": {
"diarization": "speaker_change",
"speaker_change_sensitivity": 0.8
}
}
Note: Speaker change is only recorded as JSON V2 output, so make sure you use the json-v2
format when you retrieve the transcript.
The speaker_change_sensitivity
property, if used, must be a numeric value between 0 and 1. It indicates to the algorithm how sensitive to speaker change events you want to make it. A low value will mean that very few changes will be signalled (with higher possibility of false negatives), whilst a high value will mean you will see more changes in the output (with higher possibility of false positives). If this property is not specified, a default of 0.4 is used.
Speaker change elements in the results
array appear like this:
{
"type": "speaker_change",
"start_time": 0.55,
"end_time": 0.55,
"alternatives": []
}
Note: Although there is an alternatives
property in the speaker change element it is always empty, and can be ignored. The start_time
and end_time
properties are always identical, and provide the time when the change was detected.
A speaker change indicates where we think a different person has started talking. For example, if one person says "Hello James" and the other responds with "Hi", there should be a speaker_change
element between "James" and "Hi", for example:
{
"format": "2.4",
"job": {
....
"results": [
{
"start_time": 0.1,
"end_time": 0.22,
"type": "word",
"alternatives": [
{
"confidence": 0.71,
"content": "Hello",
"language": "en",
"speaker": "UU"
}
]
},
{
"start_time": 0.22,
"end_time": 0.55,
"type": "word",
"alternatives": [
{
"confidence": 0.71,
"content": "James",
"language": "en",
"speaker": "UU"
}
]
},
{
"start_time": 0.55,
"end_time": 0.55,
"type": "speaker_change",
"alternatives": []
},
{
"start_time": 0.56,
"end_time": 0.61,
"type": "word",
"alternatives": [
{
"confidence": 0.71,
"content": "Hi",
"language": "en",
"speaker": "UU"
}
]
}
]
}
Speaker change can be combined with channel diarization. It will process channels separately and indicate in the output both the channels and the speaker changes. For example, if a two-channel audio contains two people greeting each other (both recorded over the same channel), the config submitted with the audio can request the speaker change detection:
{
"type": "transcription",
"transcription_config": {
"diarization": "channel_and_speaker_change",
"speaker_change_sensitivity": 0.8
}
}
The output will have special elements in the results
array between two words where a different person starts talking. For example, if one person says "Hello James" and the other responds with "Hi", there will a speaker_change
json element between "James" and "Hi".
{
"format": "2.4",
"job": {
....
},
"metadata": {
....
},
"results": [
{
"channel": "channel_1",
"start_time": 0.1,
"end_time": 0.22,
"type": "word",
"alternatives": [
{
"confidence": 0.71,
"content": "Hello",
"language": "en",
"speaker": "UU"
}
]
},
{
"channel": "channel_1",
"start_time": 0.22,
"end_time": 0.55,
"type": "word",
"alternatives": [
{
"confidence": 0.71,
"content": "James",
"language": "en",
"speaker": "UU"
}
]
},
{
"channel": "channel_1",
"start_time": 0.55,
"end_time": 0.55,
"type": "speaker_change",
"alternatives": []
},
{
"channel": "channel_1",
"start_time": 0.56,
"end_time": 0.61,
"type": "word",
"alternatives": [
{
"confidence": 0.71,
"content": "Hi",
"language": "en",
"speaker": "UU"
}
]
}
]
}
This allows a custom dictionary wordlist to be added to the container at runtime. Having additional words can improve the likelihood it will be output in the final transcription. For any audio file being transcribed one custom dictionary can be provided.
Prior to using this feature, consider the following requirements:
sounds_like
feature to increase the likelihood of recognition if this is the case To enable this feature, you use the additional_vocab
property of the config object:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"additional_vocab": [
"speechmagic",
"supercalifragilisticexpialidocious",
"Techcrunch",
"Yahoo! Answers"
]
}
}
The Custom Dictionary feature supports the sounds_like
extension that allows you to pass alternate pronunciations to words. For example the phrases "North Utsire" and "South Utsire" could be added as follows:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"additional_vocab": [
{ "content": "North Utsire", "sounds_like": ["North at Sierra"]},
{ "content": "South Utsire", "sounds_like": ["South at Sierra"]},
"Fitzroy",
"Forties",
{ "content": "CEO", "sounds_like": ["C.E.O."]}
]
}
}
You can see the custom dictionary entries in the transcription output below as well
Example response:
{
"format": "2.4",
"metadata": {
"created_at": "2020-07-01T14:36:15.297Z",
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "none",
"additional_vocab": [
{
"content": "North Utsire",
"sounds_like": [
"North at Sierra"
]
},
{
"content": "South Utsire",
"sounds_like": [
"South at Sierra"
]
},
{
"content": "Fitzroy"
},
{
"content": "Forties"
},
{
"content": "CEO",
"sounds_like": [
"C.E.O."
]
}
]
}
}
Note: additional_vocab
items that are multi-word phrases will be output as a single word (e.g. Yahoo! Answers would be a single content
item rather than two)
Processing a large custom dictionary repeatedly can be CPU consuming and inefficient. The Speechmatics Batch Container includes a cache mechanism for custom dictionaries to limit excessive resource use. By using this cache mechanism, the container can reduce the overall time needed for speech transcription when repeatedly using the same custom dictionaries. You will see performance benefits on re-using the same custom dictionary from the second time onwards.
It is not a requirement to use the shared cache to use the Custom Dictionary.
The cache volume is safe to use from multiple containers concurrently if the operating system and its filesystem support file locking operations. The cache can store multiple custom dictionaries in any language used for batch transcription. It can support multiple custom dictionaries in the same language.
If a custom dictionary is small enough to be stored within the cache volume, this will take place automatically if the shared cache is specified.
For more information about how the shared cache storage management works, please see Maintaining the Shared Cache.
We highly recommend you ensure any location you use for the shared cache has enough space for the number of custom dictionaries you plan to allocate there. How to allocate custom dictionaries to the shared cache is documented below.
How to set up the Shared Cache
The shared cache is enabled by setting the following value when running transcription:
/cache
when submitting a jobSM_CUSTOM_DICTIONARY_CACHE_TYPE
: (mandatory if using the shared cache) This environment variable must be set to shared
SM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE
: (optional if using the shared cache). This determines the maximum size of any single custom dictionary that can be stored within the shared cache in bytesSM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE
with a value of 10000000 would set a total storage size of 10MB-1
will allow every custom dictionary to be stored within the shared cache. This is the default assumed valueSM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE
will still be used in transcription, but will not be cachedMaintaining the Shared Cache
If you specify the shared cache to be used and your custom dictionary is within the permitted size, Speechmatics Batch Container will always try to cache the custom dictionary. If a custom dictionary cannot occupy the shared cache due to other cached custom dictionaries within the allocated cache, then older custom dictionaries will be removed from the cache to free up as much space as necessary for the new custom dictionary. This is carried out in order of the least recent custom dictionary to be used.
Therefore, you must ensure your cache allocation large enough to handle the number of custom dictionaries you plan to store. We recommend a relatively large cache to avoid this situation if you are processing multiple custom dictionaries using the batch container (e.g 50 MB). If you don't allocate sufficient storage this could mean one or multiple custom dictionaries are deleted when you are trying to store a new custom dictionary.
It is recommended to use a docker volume with a dedicated filesystem with a limited size. If a user decides to use a volume that shares filesystem with the host, it is the user's responsibility to purge the cache if necessary.
Creating the Shared Cache
In the example below, transcription is run where an example local docker volume is created for the shared cache. It will allow a custom dictionary of up to 5MB to be cached.
docker volume create speechmatics-cache
docker run -i -v /home/user/sm_audio.wav:/input.audio \
-v /home/user/config.json:/config.json:ro \
-e SM_CUSTOM_DICTIONARY_CACHE_TYPE=shared \
-e SM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE=5000000 \
-v speechmatics-cache:/cache \
-e LICENSE_KEY=f787b0051e2768bcee3231f619d75faab97f23ee9b7931890c05f97e9f550702 \
speechmatics-docker-example.jfrog.io/transcriber-en:7.0.0
Viewing the Shared Cache
If all set correctly and the cache was used for the first time, a single entry in the cache should be present.
The following example shows how to check what Custom Dictionaries are stored within the cache. This will show the language, the sampling rate, and the checksum value of the cached dictionary entries.
ls $(docker inspect -f "{{.Mountpoint}}" speechmatics-cache)/custom_dictionary
en,16kHz,db2dd9c0d10faa8006d8a3fabc86aef6b6e27b3ccbd2a945d3aae791c627f0c5
Reducing the Shared Cache Size
Cache size can be reduced by removing some or all cache entries.
rm -rf $(docker inspect -f "{{.Mountpoint}}" speechmatics-cache)/custom_dictionary/*
Before manually purging the cache, ensure that no containers have the volume mounted, otherwise an error during transcription might occur. Consider creating a new docker volume as a temporary cache while performing purging maintenance on the cache.
It is possible to optionally specify the language locale to be used when generating the transcription output, so that words are spelled correctly, for cases where the model language is generic and doesn't already imply the locale.
Currently, Global English is the only language pack that supports different output locales. The following locales are supported:
The output_locale
configuration setting is used for this. As an example, the following configuration uses the Global English (en) language pack with an output locale of British English (en-GB):
{
"type": "transcription",
"transcription_config": {
"language": "en",
"output_locale": "en-GB"
}
}
Some language models now support advanced punctuation. This uses machine learning techniques to add in more naturalistic punctuation to make the transcript more readable. As well as putting punctuation marks in more naturalistic positions in the output, additional punctuation marks such as commas (,) and exclamation and question marks (!, ?) will also appear.
There is no need to explicitly enable this in the job configuration; languages that support advanced punctuation will automatically output these marks. If you do not want to see these punctuation marks in the output, then you can explicitly control this through the punctuation_overrides
settings in the config.json file, for example:
{
"type": "transcription",
"transcription_config": {
"language": "en",
"punctuation_overrides": {
"permitted_marks": [".", ","]
}
}
}
Both plain text and JSON output supports punctuation. JSON output places punctuation marks in the results list marked with a type
of "punctuation"
. So you can also filter on the output if you want to modify or remove punctuation.
A sample JSON output containing punctuation looks like this:
{
"alternatives": [
{
"confidence": 1,
"content": ",",
"language": "en",
"speaker": "UU"
}
],
"attaches_to": "previous",
"end_time": 10.15,
"is_eos": false,
"start_time": 10.15,
"type": "punctuation"
}
Note: Advanced punctuation is a V2 feature so, only the V2 output format will show advanced punctuation marks.
is_eos_
is a parameter only passed in the transcription output when Advanced punctuation is used. EOS stands for 'end of sentence' and will only give a Boolean value of either true or false.
If you specify the punctuation_overrides element for languages that do not yet support advanced punctuation, then it will be ignored.
Speechmatics allows customers to receive callbacks to a web service they control.
Speechmatics will then make a HTTP POST request once the transcription is available.
If you wish to enable notifications, you must add the notification_config
only as part of the
config.json
object. This is separate to the transcription_config. The following
parameters are available:
url
: (mandatory) The URL to which a notification message will be sent upon completion of the job. Ifcontents
is empty, then the body of the message will be emptycontents
: (optional) Specifies a list of item(s) to be attached to the notification message.
If only one item is listed, it will be sent as the body of the request with Content-Type set to an appropriate value such as application/octet-stream
or application/json.
If multiple items are listed they will be sent as named file attachments using the multipart content type.
Examples of what can be sent include the following: data
: The audio file submitted for the job.jobinfo
: A summary of the job. This will only be provided if you provide a jobinfo.json file when submitting a file for transcription. Please see the relevant section for informationtranscript.json-v2
: The transcript in json-v2
format.transcript.txt
: The transcript in txt
format.transcript.srt
: The transcript in srt
format.method
: (optional) the method to be used with HTTP and HTTPS URLs. If no option is chosen, the
default is POST. PUT is now supported to allow uploading of content directly to an object store such as S3. auth_headers
: (optional) A list of additional headers to be added to the notification request when using http or https.
This is intended to support authentication or authorization, for example by supplying an OAuth2 bearer token.If you want to upload content directly to an object store, for example Amazon S3, you must ensure that the URL grants the Speechmatics container appropriate permissions when carrying out notifications. Pre-authenticated URLs, generated by an authorsed user, allow non-trusted devices access to upload to access stores. AWS carries this out via generating pre-signed URLs. Microsoft Azure allows similar acess via Shared Access Signatures.
Please see the section [How to transcribe files stored online](### How to transcribe files stored online) for details of how to pull files from online storage locations for transcription, and more information on pre-authenticated URLs
An example request for transcription in English with notification_config
is shown below:
{
"type": "transcription",
"transcription_config": { "language": "en" },
"notification_config": [
{
"url": "https://collector.example.org/callback",
"contents": [ "transcript", "data" ],
"auth_headers": ["Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhb"]
}
]
}
If the callback is unsuccessful, it will repeat up to three times in total. If, after three times, it is still unsuccessful, it will process only the transcript via STDOUT.
In addition to our primary JSON format, the Speechmatics container can
output transcripts in the plain text (TXT) and SubRip (SRT) subtitle format.
This can be done by using --allformats
command and then specifying <$EXAMPLE_DIRECTORY>
parameter within
the transcription request. The <$EXAMPLE_DIRECTORY>
is where all supported
transcript formats will be saved. Users can also use --all-formats
to generate the same response.
This directory must be mounted into the container so the transcripts can be retrieved after container finishes. You will receive a transcript in all currently supported formats: JSON, TXT, and SRT.
The following example shows how to use --allformats
parameter.
In this scenario, after processing the file, three separate transcripts would be
found in the ~/tmp/output
directory. These transcripts would be in JSON,
TXT, and SRT format.
docker run \
-v ~/Projects/ba-test/data/shipping-forecast.wav:/input.audio \
-v ~/tmp/config.json:/config.json \
-v ~/tmp/output:/example_output_dir_name \
speechmatics-docker-example.jfrog.io/transcriber-en:7.0.0 \
--allformats /example_output_dir_name
SubRip (SRT) is a subtitling format that can be used in to generate subtitles for video content or other workflows. Our SRT output will generate a transcript together with corresponding alignment timestamps. We follow best practice as recommended by major broadcasters in our default line length and number of lines output.
You can change the maximum number of lines supported, and the maximum character
space within a line, by using configuration options as part of the output_config,
which is part of the overall config.json
object described below:
{
"type": "transcription",
"transcription_config": {
...
},
"output_config": {
"srt_overrides": {
"max_line_length": 37,
"max_lines": 2
}
}
}
max_line_length
: sets maximum count of characters per subtitle line including white space (default: 37
).max_lines
: sets maximum count of lines in a subtitle section (default: 2
).If you want to access a file stored in cloud storage, for example AWS S3 or Azure
Blob Storage, you can use the fetch_data
parameter within the config.json
object. The fetch_data
parameter specifies a cloud storage location.
You must ensure the URL you provide grants Speechmatics appropriate privileges to access the necessary files, otherwise this will result in a transcription error. Cloud providers like AWS and Azure allow temporary access to non-privileged parties to access and upload objects to cloud storage via generation of authenticated URLs by an authorised user. AWS recommends using pre-signed URLs to grant access when accessing objects from and uploading to S3. Azure recommends use of shared access signatures when accessing from and uploading to Azure Storage. Speechmatics supports both of these options
A pre-generated URL will containe authorization parameters within the URL. These can include information about how long the URL is valid for and what permissions access to the URL enables. More information is present on the page of each cloud provider
To successfully call data objects stored online using the Speechmatics container you must use the following parameters:
url
: (mandatory if you want to access an online file) the location of the fileauth_headers
: (optional) If your cloud storage solution requires authentication. The auth_headers
parameter provides the headers necessary to access the resource.
This is intended to support authentication or authorization when using http or https, for example by supplying an OAuth2 bearer tokenAn example is below:
{
"type": "transcription",
"transcription_config": {
"language": "en"
},
"fetch_data": {
"url": "https://example.s3.amazonaws.com/folder/file.mp3?&AWSAccessKeyId=...&Expires=...&Signature=..."
}
}
The jobInfo file
You can optionally submit additional information to the batch container that can then be used as further or tracking metadata. To do so you must submit a jobInfo
file as a sepatate json object. This file is separate to the config.json
object
when submitting a request. The jobInfo
file must include a unique id, the name and duration of the data file, and the UTC date the job was created. This information is then available in job results and in callbacks.
When using a jobInfo
file you must submit the following mandatory properties:
created_at
- The UTC time the job was created at. An example is "2019-01-17T17:50:54.113Z"
data_name
- The name of the file submitted as part of the job. An example is example.wav
. This does not need to match the actual file nameduration
- The length of the audio file. This must be an integer value in seconds and must be at least 0id
- A customer-unique ID that is assigned to a job. This is not a value provided by SpeechamticsOptional Metadata
You may also submit the following optional properties as part of metadata tracking. These are properties that are unique to your organisation that you may wish to or are required to track through a company workflow or where you are processing large amounts of files. This information will then be available in the jobInfo output and in notification callbacks:
tracking
- Parent of the following child properties. If you are submitting metadata for tracking this must be includedtitle
- The title of the job reference
- External system referencetag
- Any tags by which you associate files or datadetails
- Customer-defined JSON structure. These can include information valuable to you about the job An example jobInfo.json
file is below, with optional metadata inserted
{
"created_at": "2020-06-26T12:12:24.625Z",
"data_name": "example_file",
"duration": 5,
"id": "1",
"tracking": {
"title": "ACME Q12018 Statement",
"reference": "/data/clients/ACME/statements/segs/2018Q1-seg8",
"tags": [
"quick-review",
"segment"
],
"details": {
"client": "ACME Corp",
"segment": 8,
"seg_start": 963.201,
"seg_end": 1091.481
}
}
}
Running the JobInfo file
Here is an example of processing a file on the batch container with an example jobInfo file:
docker run -v /PATH/TO/FILE/jobInfo.json:/jobInfo.json \
-v /PATH/TO/FILE/config.json:/config.json \
-v /PATH/TO/FILE/audio.wav:/input.audio \
-e LICENSE_KEY=$license speechmatics-docker-prod-productsteam.jfrog.io/transcriber-en:7.0.0
jobInfo Output Example
Here is an example of the json output when using a jobInfo file, with the first word of the transcript. You can see the output is divided into several sections:
{
"format": "2.4",
"license": "productsteam build (Thu May 14 14:33:09 2020): 953 days remaining",
"job": {
"created_at": "2020-07-01T12:46:34.393Z",
"data_name": "example.wav",
"duration": 128,
"id": "1",
"tracking": {
"details": {
"client": "ACME Corp",
"segment": 8,
"seg_start": 963.201,
"seg_end": 1091.481
},
"reference": "/data/clients/ACME/statements/segs/2018Q1-seg8",
"tags": [
"quick-review",
"segment"
],
"title": "ACME Q12018 Statement"
}
},
"metadata": {
"created_at": "2020-07-01T12:47:28.470Z",
"type": "transcription",
"transcription_config": {
"language": "en",
"diarization": "speaker"
}
},
"results": [
{
"alternatives": [
{
"confidence": 1.0,
"content": "This",
"language": "en",
"speaker": "M1"
}
],
"end_time": 1.98,
"start_time": 1.86,
"type": "word"
}
]
}
NB When using the jobInfo file the format output will show 2 created_at
parameters. The created_at
under job
is when the file was submitted for transcription
The createdDate
under metadata
is when the output was produced. The time difference between the two provides the total transcription time, including any system delays as well as the actual time taken to process the job.
Below are the full API references for the config.json
and the jobInfo.json
files.
The config.json
is constructed of multiple configuration settings, each of which is responsible for a separate section of transcription output. All configuration settings are passed within the type
object Only transcription_config
is mandatory.
transcription_config
Name | Type | Description | Required |
---|---|---|---|
language | string | Language model to process the audio input, normally specified as an ISO language code | Yes |
additional_vocab | [object] | List of custom words or phrases that should be recognized. Alternative pronunciations can be specified to aid recognition. | No |
punctuation_overrides | [object] | Control punctuation settings. Only valid with languages that support advanced punctuation. These are English, French, German, Spanish, Dutch, Malay, and Turkish. | No |
diarization | string | The default is none . You may specify options of speaker , channel ,speaker_change , channel_and_speaker_change , or none | No |
channel_diarization_labels | [string] | Transcript labels to use when using collating separate input channels. Only applicable when you have selected channel as a diarization option | No |
output_locale | string | Only applicable with global English. Correct maps words to local spellings. Options are, en-AU , en-GB , or en-US | No |
fetch_data
Name | Type | Description | Required |
---|---|---|---|
url | string | The online location of the file. | Yes |
auth_headers | [string] | A list of additional headers to be added to the input fetch request when using http or https. This is intended to support authentication or authorization, for example by supplying an OAuth2 bearer token. | No |
notification_config
Name | Type | Description | Required |
---|---|---|---|
url | string | The url to which a notification message will be sent upon completion of the job. If only one item is listed, it will be sent as the body of the request with Content-Type set to an appropriate value such as application/octet-stream or application/json . If multiple items are listed they will be sent as named file attachments using the multipart content type. If contents is not specified, the transcript item will be sent as a file attachment named data_file , for backwards compatibility. If the job was rejected or failed during processing, that will be indicated by the status, and any output items that are not available as a result will be omitted. The body formatting rules will still be followed as if all items were available. The user-agent header is set to Speechmatics API V2 in all cases. | Yes |
content | [string] | Specifies a list of items to be attached to the notification message. When multiple items are requested, they are included as named file attachments. | No |
method | string | The method to be used with http and https urls. The default is POST. | No |
auth_headers | [string] | A list of additional headers to be added to the input fetch request when using http or https. This is intended to support authentication or authorization, for example by supplying an OAuth2 bearer token. | No |
output_config
Name | Type | Description | Required |
---|---|---|---|
srt_overrides | object | Parameters to override the defaults for SubRip (srt) subtitle format. - max_line_length : sets maximum count of characters per subtitle line including white space (default: 37). -max_lines : sets maximum number of lines per subtitle segment (default: 2). | No |
Name | Type | Description | Required |
---|---|---|---|
createdAt | dateTime | The UTC date time the job was created. | Yes |
data_name | string | Name of the data file submitted for job. | No |
duration | integer | The file duration (in seconds). | No |
tracking | object | Additional tracking information | No |
The following information can be passed within the tracking object as part of the jobInfo file
Name | Type | Description | Required |
---|---|---|---|
title | string | The title of the job. | No |
reference | string | External system reference. | No |
tags | [string] | A set of keywords | No |
details | object | Customer-defined JSON structure. | No |
For a full JobInfo example please see the example above in [How to track a file](###How to track a file.)