/
Batch Virtual Appliance
/
API Guide
/
How to use the V2 API
This is the documentation for a previous version of our product. Click here to see the latest version.

How To Use the V2 API

Deprecation Note

The V1 API is now deprecated and will be removed by February 2022. Please follow the below instructions for using the V2 API.

This section will take you through how to send a file to the V2 Speech API in the Batch Virtual Appliance and receive a finished transcript. It will also show you how to configure the transcription to use supported speech features.

Key Differences in the V2 API

As part of the 3.7.0 release, the Batch Virtual Appliance supports the same V2 API that is present in the Speechmatics Cloud Offering, in addition to the existing V1 API.

The table below shows the key differences between the V2 API in the Speechmatics Cloud Offering, the V2 API in the Batch Virtual Appliance, and the V1 API in the Batch Virtual Appliance. Major changes to the V2 API in the Batch Virtual Appliance in comparison to the V1 API are bolded. If not mentioned, you should assume features are the same across all offerings

FeatureBatch Virtual Appliance - V2 APICloud OfferingBatch Virtual Appliance - V1 API
Job IDSequential numeric string (e.g '1')alphanumeric 10 character string (e.g. '4gnixktjxd')Sequential numeric string (e.g. '1')
Submitting jobsUses the endpoint /v2/jobs/Uses the endpoint /v2/jobs/Uses the end point /v1/user/$USERID/jobs/
Retrieving the status of one jobUses the endpoint /v2/jobs/$JOBIDUses the endpoint /v2/jobs/$JOBIDUses the end point /v1/user/$USERID/jobs/$JOBID
Retrieve submitted media file via API RequestNot SupportedSupported by GET request to v2/$JOBID/dataNot Supported
Retrieving the transcriptUses the endpoint /v2/jobs/$JOBID/transcriptUses the endpoint /v2/jobs/$JOBID/transcriptUses the endpoint /v1/user/$USERID/jobs/$JOBID/transcript
Retrieving logs for any one jobSupported via the endpoint /v2/jobs/$JOBID/logNot yet supportedNot supported
Cancelling a running jobSupported using a DELETE request and query parameter force=trueNot SupportedNot supported
Supported Output Formatsjson-v2, txt, srtjson-v2, txt, srtjson (legacy) json-v2, txt, srt
Fetch URLSupported via the Configuration ObjectSupported via the Configuration ObjectNot Supported
Notifications/CallbacksSupported via the Configuration ObjectSupported via the Configuration ObjectSupported via the callback parameter. Not all features in the V2 API or Cloud Offering are supported
Notification ErrorsSupported via the Configuration ObjectSupported via the Configuration ObjectUnsupported
Notification User AgentSpeechmatics-API/2.0Speechmatics-API/2.0Speechmatics API v1.0
Supported items via CallbackTranscript in all supported output formats, metadata informationTranscript in all supported output formats, metadata information, data file submittedTranscript in all supported formats
MetadataSupported only by the tracking object in the Configuration ObjectSupported only by the tracking object in the Configuration ObjectSupported only via the meta parameter

Quick Start

This quick start guide will show you how to submit a media file for processing and then retrieve a transcript in the format of your choice via the V2 API, the recommended method of using the Batch Virtual Appliance. It will also show you optionally how to check the status of a job and to delete it once it has completed.

Pre Requisites

  • You have successfully imported, installed, and licensed the appliance of your choice as shown in the Installation Guide

Examples

All examples in this document use curl to make the REST API call from a command line. We recommend using retry parameters, so that retry attempts can be made for at least one minute. With the curl command this is done with the --retry 5 --retry-delay 10 parameters. This has been omitted from the examples in this document for brevity.

Note: If you are using a self-signed certificate (your own, or the Speechmatics certificate that is used by default), then you will see a warning like this when using the curl command to access the Speech API using HTTPS:

curl: (60) SSL certificate problem: self signed certificate

We recommend, if you are going to use the secure Speech API, that you upload your own SSL certificate (signed by a CA) to the appliance, to avoid this problem. See the Installation and Admin Guide for details of how to do this.

Submitting a Job

To successfully submit a job you must send a HTTP POST request to your chosen endpoint with:

  • The request type. This is always transcription
  • The language you want your transcript in. This is submitted within configuration object as part of the transcription config as a two-letter ISO 639-1 code, and is mandatory
    • For more details about the configuration object, please see sections below in the section on Configuring the Transcript
  • A media file in a supported format, or a URL address of a file location the appliance is authorised to fetch

An example is below for a transcript request in English:

curl -X POST 'https://${APPLIANCE_HOST}/v2/jobs/' \
   -F data_file=@example.wav \
   -F config='{
     "type": "transcription",
     "transcription_config": { "language": "en" }
   }' \

If you are successful, you will receive a HTTP 201 request and a Job ID. A Job ID is a unique sequential numeric string. You will need this job ID to retrieve any transcript generated.

Checking on a Job Status

If you want to see the progress of an individual job you can make a GET request. You must include the Job ID you want to check in the GET request.

To retrieve a job run the following request:

curl -X GET 'https://${APPLIANCE_HOST}/v2/jobs/$JOBID'

Here is an example of a successful response for a completed job:

{
    "jobs": [
        {
            "config": {
                "transcription_config": {
                    "language": "en"
                },
                "type": "transcription"
            },
            "created_at": "2020-12-08T09:49:39.907Z",
            "data_name": "Can robots care for us_.mp3",
            "duration": 379,
            "id": "1",
            "status": "done"
        }
    ]
}

In the response you will receive:

  • The configuration information used to submit that job
  • The time the job was created
  • The duration of the audio file measured in seconds
  • The status of the job. If it is finished, the job status should return done. If the job is still being processed it will return running.
  • The ID of the job you requested

Checking the status of multiple submitted jobs

If you wish you can retrieve all jobs submitted to the appliance within the last 24 hours by not including the job ID in the GET request. An example is below

curl -X GET 'https://${APPLIANCE_HOST}/v2/jobs/'

If successful you will receive a 200 response and all available jobs:

{
    "jobs": [
        {
            "created_at": "2021-01-08T11:58:04.124Z",
            "data_name": "IsTheRecyclingSystemBroken.mp3",
            "duration": 377,
            "id": "2",
            "status": "running"
        },
        {
            "created_at": "2021-01-08T11:57:48.945Z",
            "data_name": "Can robots care for us_.mp3",
            "duration": 379,
            "id": "1",
            "status": "running"
        }
    ]
}

Please note if you request to see all jobs, you will not see the configuration for each job. Configuration information can only be retrieved for a request for an individual job. If you have changed the clean up job on the appliance to run at more frequent intervals than the default 24 hours you will only see jobs posted after that clean-up job ran.

You can now retrieve a transcript from the appliance.

Retrieving a Transcript

Here is an example request to retrieve a transcript from a completed job:

curl -X GET 'https://${APPLIANCE_HOST}/v2/jobs/$JOBID/transcript'

You must put the job ID within the URL path that you received upon successfully requesting the transcription job.

If you request a transcript before it has finished processing, you will receive a HTTP 404 message. To avoid this, you can configure notifications so that you can retrieve transcripts via callback when completed. For details of setting up notifications, please see the section on 'Configuring the Job Request'.

The default format for any transcript is json-v2. Speechmatics also supports transcripts in plain text (TXT) and SubRip Title (SRT) formats. To do so you must explicitly request these.

An example of a successful retrieval of a transcript in plain TXT format:

curl -X GET 'https://${APPLIANCE_HOST}/v2/jobs/$JOBID/transcript?format=txt' 

Here is an example of a successful request of a transcript in SRT format:

curl -X GET 'https://${APPLIANCE_HOST}/v2/jobs/$JOBID/transcript?format=srt' 

You can receive transcripts in multiple output formats simultaneously via notifications requested in the initial POST submission.

You should now have been able to submit a file and retrieve a transcript.

Deleting a Job

In addition, you can delete a transcript using a HTTP DELETE request only once it has finished processing. The default retention period for a transcript on the Batch Virtual Appliance is 24 hours. You can alter the configuration of the appliance to shorten this retention period via the Management API; how to do so is documented in the installation guide

You must include in the request the Job ID you wish to delete

curl -X DELETE 'https://${APPLIANCE_HOST}/v2/jobs/$JOBID/'

If you have successfully deleted the transcript, you will receive a HTTP 200 response, and a summary of the job you have just deleted. An example is below

{
    "job": {
        "config": {
            "transcription_config": {
                "language": "en"
            },
            "type": "transcription"
        },
        "created_at": "2020-12-10T15:38:33.866Z",
        "data_name": "Can robots care for us_.mp3",
        "duration": 379,
        "id": "5",
        "status": "deleted"
    }
}

You cannot delete multiple jobs at once.

Cancelling a Job

Via the V2 API, you are now able to delete a running job. In this case, no transcript will be returned, and any seconds deducted for processing the transcript will be returned to the license.

To cancel a running job, use the query parameter force=true when sending a DELETE request. An example is below

curl -x DELETE 'https://${APPLIANCE_HOST}/v2/jobs/$JOBID/?force=true'

The response will show the job, and a status of deleted: an example is below:

{
    "job": {
        "config": {
            "transcription_config": {
                "language": "en"
            },
            "type": "transcription"
        },
        "created_at": "2021-02-02T13:45:37.074Z",
        "data_name": "6MinuteEnglish-20200528-IsTheRecyclingSystemBroken.mp3",
        "duration": 378,
        "id": "9",
        "status": "deleted"
    }
}

By default the force flag is false. This means a DELETE request without the force or force=false flag for a running job will return HTTP 423 Resource Locked and the transcript will continue to be processed. If the job has already finished, the request will be handled as a normal DELETE request (e.g. the transcript will be deleted, but no time will be returned to the appliance license)

Accessing the Docs on the Appliance

You can access a dashboard on the appliance from any browser by navigating to the following URL:

https://${APPLIANCE_HOST}/help

From here you can get to documentation links, and to Swagger UI pages which let you access the Management API.

Configuring the V2 Speech API Request

The following sections will show how to use the configuration object when submitting a request in order to use various Speechmatics features in the Batch Virtual Appliance. Where features are only supported in the V2 API this will be made explicit.

To configure any transcription request you must alter the relevant part of the configuration object:

  • fetch_data config: If you want to fetch a file stored in an online location
  • transcription_config: For any of the following features:
    • Diarization
    • Custom Dictionary/additional vocabulary
    • Output Locale
    • Advanced Punctuation
  • notification_config: For receiving any notifications from the appliance. You can receive updates as to the job's status, or the transcript once completed
  • output_config: For altering the presentation of transcripts, only valid in SRT format

Fetch URL

The previous example showed how to create a job from a locally uploaded audio file. If you store your digital media in cloud storage (for example AWS S3 or Azure Blob Storage) you can also submit a job by providing the URL of the audio file. The configuration uses a fetch_data section, which looks like this:

curl -X POST 'https://${APPLIANCE_HOST}/v2/jobs' \
   -F config='{
      "type": "transcription",
      "transcription_config": { "language": "en" },
      "fetch_data": { "url": "https://s3.us-east-2.amazonaws.com/bucketname/jqld_/20180804102000/profile.m4v" }
    }' \

A note on best practice

If you are using pre-signed URLs, please ensure these have not expired before sending them to the appliance, as the job will fail.

If you need to additional authentication or authorization the appliance supports an optional auth_headers parameter where these can be supplied: e.g. when using an OAuth2 Bearer token.

Speaker Separation (Diarization)

Speechmatics offers four different modes for separating out different speakers in the audio:

TypeDescriptionUse Case
speaker diarizationAggregates all audio channels into a single stream for processing and picks out unique speakers based on acoustic matching.Used in cases where there are multiple speakers embedded in the same audio recording and it's required to understand what each unique speaker said.
channel diarizationTranscribes each audio channel separately and treats each channel as a unique speaker.Used when it's possible to record each speaker on separate audio channels.
speaker change (beta)Provides the point in transcription when there is believed to be a new speaker.Used for when you just need to know the speaker has changed usually in a real-time application.
channel diarization & speaker changeTranscribes each audio channel separately and within each channel provides the point when there is believed to be a new speaker.Used when it's possible to record some speakers on a separate audio channel, but some channels there are multiple speakers.

Each of these modes can be enabled by using the diarization config. The following are valid values:

The default value is none - e.g. the transcript will not be diarized.

TypeConfig Value
speaker diarizationspeaker
channel diarizationchannel
speaker changespeaker_change
channel diarization & speaker changechannel_and_speaker_change

Speaker Diarization

Speaker diarization aggregates all audio channels into a single stream for processing, and picks out different speakers based on acoustic matching.

By default the feature is disabled. To enable speaker diarization the following must be set when you are using the config object:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "speaker"
  }
}

When enabled, every word and punctuation object in the output results will be a given "speaker" property which is a label indicating who said that word. There are two kinds of labels you will see:

  • S# - S stands for speaker and the # will be an incrementing integer identifying an individual speaker. S1 will appear first in the results, followed by S2 and S3 etc.
  • UU - Diarization is disabled or individual speakers cannot be identified. UU can appear for example if some background noise is transcribed as speech, but the diarization system does not recognise it as a speaker.

Note: Enabling diarization increases the amount of time taken to transcribe an audio file. In general we expect diarization to take roughly the same amount of time as transcription does, therefore expect the use of diarization to roughly double the overall processing time.

The example below shows relevant parts of a transcript with 3 speakers. The output shows the configuration information passed in the config.json object and relevant segments with the different speakers in the JSON output. Only part of the transcript is shown here to highlight how different speakers are displayed in the output.


"format": "2.6",
"metadata": {
    "created_at": "2020-07-01T13:26:48.467Z",
    "type": "transcription",
    "transcription_config": {
      "language": "en",
      "diarization": "speaker"
      }
  },
 "results": [
    {
      "alternatives": [
        {
          "confidence": 0.93,
          "content": "hello",
          "language": "en",
          "speaker": "S1"
        }
      ],
      "end_time": 0.51,
      "start_time": 0.36,
      "type": "word"
    },
    {
      "alternatives": [
        {
          "confidence": 1.0,
          "content": "hi",
          "language": "en",
          "speaker": "S2"
        }
      ],
      "end_time": 12.6,
      "start_time": 12.27,
      "type": "word"
    },
    {
      "alternatives": [
        {
          "confidence": 1.0,
          "content": "good",
          "language": "en",
          "speaker": "S3"
        }
      ],
      "end_time": 80.63,
      "start_time": 80.48,
      "type": "word"
    }

In our JSON output, start_time identifies when a person starts speaking each utterance and end_time identifies when they finish speaking.

Speaker diarization post-processing

To enhance the accuracy of our speaker diarization, we make small corrections to the speaker labels based on the punctuation in the transcript. For example if our system originally thought that 9 words in a sentence were spoken by speaker S1, and only 1 word by speaker S2, we will correct the incongruous S2 label to be S1. This only works if punctuation is enabled in the transcript.

Therefore if you disable punctuation, for example by removing all permitted_marks in the punctuation_overrides section of the config.json then expect the accuracy of speaker diarization to vary slightly.

Speaker diarization timeout

Speaker diarization will timeout if it takes too long to run for a particular audio file. Currently the timeout is set to 5 minutes or 0.5 * the audio duration; whichever is longer. For example, with a 2 hour audio file the timeout is 1 hour. If a timeout happens the transcript will still be returned but without the speaker labels set.

If the diarization does timeout you will see an ERROR message in the logs that looks like this:

Speaker diarization took too long and timed out (X seconds).

If a timeout occurs then all speaker labels in the output will be labelled as UU.

Under normal operation we do not expect diarization to timeout, but diarzation can be affected by a number of factors including audio quality and the number of speakers. If you do encounter timeouts frequently then please get in contact with Speechmatics support.

Channel Diarization

Channel diarization allows individual channels in an audio file to be labelled. This is ideal for audio files with multiple channels (up to 6) where each channel is a unique speaker.

By default the feature is disabled. To enable channel diarization the following must be set when you are using the config object:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "channel"
  }
}

The following illustrates an example configuration to enable channel diarization on a 2-channel file that will use labels Customer for channel 1 and Agent for channel 2:

{
  "type": "transcription",
  "transcription_config": {
    "language": "en",
    "diarization": "channel",
    "channel_diarization_labels": ["Customer", "Agent"]
  }
}

For each named channel, the words will be listed in its own labelled block, for example:


  {
  "format": "2.6",
  "metadata": {
    "created_at": "2020-07-01T14:11:43.534Z",
    "type": "transcription",
    "transcription_config": {
      "language": "en",
      "diarization": "channel",
      "channel_diarization_labels": ["Customer", "Agent"]
    }
  },
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.87,
          "content": "Hello",
          "language": "en"
        }
      ],
      "channel": "Customer",
      "end_time": 14.34,
      "start_time": 14.21,
      "type": "word"
    },
    {
      "alternatives": [
        {
          "confidence": 0.87,
          "content": "how",
          "language": "en"
        }
      ],
      "channel": "Agent",
      "end_time": 14.62,
      "start_time": 14.42,
      "type": "word"
    },
    {
      "alternatives": [
        {
          "confidence": 0.87,
          "content": "can",
          "language": "en"
        }
      ],
      "channel": "Agent",
      "end_time": 15.14,
      "start_time": 14.71,
      "type": "word"
    },
    {
      "alternatives": [
        {
          "confidence": 0.79,
          "content": "I",
          "language": "en"
        }
      ],
      "channel": "Agent",
      "end_time": 16.71,
      "start_time": 16.3,
      "type": "word"
    },
    {
      "alternatives": [
        {
          "confidence": 0.67,
          "content": "help",
          "language": "en"
        }
      ],
      "channel": "Agent",
      "end_time": 10.39,
      "start_time": 10.17,
      "type": "word"
    }

Note:

  • Transcript output is provided sequentially by channel. So if you have two channels, all of channel 1 would be output first, followed by all of channel 2, and so on
  • If you specify channel as a diarization option, and do not assign channel_diarization_labels then default labels will be used (channel_1, channel_2 etc)
  • Spaces cannot be used in the channel labels

Speaker Change Detection (beta feature)

This feature allows changes in the speaker to be detected and then marked in the transcript. It does not provide information about whether the speaker is the same as one earlier in the audio.

By default the feature is disabled. The config used to request speaker change detection looks like this:

{
  "type": "transcription",
  "transcription_config": {
    "diarization": "speaker_change",
    "speaker_change_sensitivity": 0.8
  }
}

Note: Speaker change is only visible in the JSON V2 output, so make sure you use the json-v2 format when you retrieve the transcript.

The speaker_change_sensitivity property, if used, must be a numeric value between 0 and 1. It indicates to the algorithm how sensitive to speaker change events you want to make it. A low value will mean that very few changes will be signalled (with higher possibility of false negatives), whilst a high value will mean you will see more changes in the output (with higher possibility of false positives). If this property is not specified, a default of 0.4 is used.

Speaker change elements appear in resulting JSON transcript results array look like this:

{
  "type": "speaker_change",
  "start_time": 0.55,
  "end_time": 0.55,
  "alternatives": []
}

Note: Although there is an alternatives property in the speaker change element it is always empty, and can be ignored. The start_time and end_time properties are always identical, and provide the time when the change was detected.

A speaker change indicates where we think a different person has started talking. For example, if one person says "Hello James" and the other responds with "Hi", there should be a speaker_change element between "James" and "Hi", for example:

{
  "format": "2.6",
  "job": {
....
  "results": [
    {
      "start_time": 0.1,
      "end_time": 0.22,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "Hello",
            "language": "en",
            "speaker": "UU"
        }
      ]
    },
    {
      "start_time": 0.22,
      "end_time": 0.55,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "James",
            "language": "en",
            "speaker": "UU"
        }
      ]
    },
    {
      "start_time": 0.55,
      "end_time": 0.55,
      "type": "speaker_change",
      "alternatives": []
    },
    {
      "start_time": 0.56,
      "end_time": 0.61,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "Hi",
            "language": "en",
            "speaker": "UU"
        }
      ]
    }
  ]
}
  • Note: You can only choose speaker_change as an alternative to speaker or channel diarization.

Speaker Change Detection With Channel Diarization

Speaker change can be combined with channel diarization. It will transcribe each channel separately and indicate in the output each channel (with labels if set) and the speaker changes on each of the channels. For example, if a two-channel audio contains three people greeting each other (with a single speaker on channel 1 and two speakers on channel 2), the config submitted with the audio to request the speaker change detection is:

{
  "type": "transcription",
  "transcription_config": {
    "diarization": "channel_and_speaker_change",
    "speaker_change_sensitivity": 0.8
  }
}

The output will have special elements in the results array between two words where a different person starts talking on the same channel.

{
    "format": "2.6",
    "job": {
....
    },
    "metadata": {
....
    },
    "results": [
    {
      "channel": "channel_2",
      "start_time": 0.1,
      "end_time": 0.22,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "Hello",
            "language": "en",
            "speaker": "UU"
        }
      ]
    },
    {
      "channel": "channel_2",
      "start_time": 0.22,
      "end_time": 0.55,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "James",
            "language": "en",
            "speaker": "UU"
        }
      ]
    },
    {
      "channel": "channel_1",
      "start_time": 0.55,
      "end_time": 0.55,
      "type": "speaker_change",
      "alternatives": []
    },
    {
      "channel": "channel_2",
      "start_time": 0.56,
      "end_time": 0.61,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "Hi",
            "language": "en",
            "speaker": "UU"
        }
      ]
    },
    {
      "channel": "channel_1",
      "start_time": 0.56,
      "end_time": 0.61,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "Hi",
            "language": "en",
            "speaker": "UU"
        }
      ]
    }
  ]
}
  • Note: Do not try to request speaker_change and channel diarization as multiple options: only channel_and_speaker_change is an accepted parameter for this configuration.

Custom Dictionary

The Custom Dictionary feature can be accessed through the additional_vocab property. This is a list of custom words or phrases that should be recognized that might not be recognised in our standard output, such as names, phrases, acronyms, or other technical terms.

Custom Dictionary also allows alternative pronunciations to be specified in order to aid recognition by use of the sounds_like parameter:

"transcription_config": {
    "language": "en",
    "additional_vocab": [
      {
        "content": "gnocchi",
        "sounds_like": [
          "nyohki",
          "nokey",
          "nochi"
        ]
      }
    ]
}

You can specify up to 1000 words or phrases (per job) in your custom dictionary.

Note: For the additional_vocab property you can provide the list of alternate pronunciations using an array of optional words or phrases using the sounds_like option. In the simple case where there is no difference of pronunciation, you can conflate the content and sounds_like fields into a single bareword or phrase (for example "speechmatics").

If you use the sounds_like property, words or phrases must not contain whitespace characters.

Output Locale

It is possible to specify the spelling rules to be used when generating the transcription, based on locale. The output_locale configuration setting is used for this. As an example, the following configuration uses the Global English (en) language pack with an output locale of British English (en-GB):

{ "type": "transcription",
  "transcription_config": {
    "language": "en",
    "output_locale": "en-GB"
  }
}

Currently, Global English is the only language pack that supports different output locales. The three locales that are available in this release are:

  • British English (en-GB)
  • US English (en-US)
  • Australian English (en-AU)

If no locale is specified then the ASR engine will use whatever spelling it has learnt as part of our language model training (in other words it will be based on the training data used).

Advanced Punctuation

Some language models now support advanced punctuation. This uses machine learning techniques to add in more naturalistic punctuation to make the transcript more readable. As well as putting punctuation marks in more naturalistic positions in the output, additional punctuation marks such as commas (,) and exclamation and question marks (!, ?) will also appear.

There is no need to explicitly enable this in the job configuration; languages that support advanced punctuation will automatically output these marks. If you do not want to see these punctuation marks in the output, then you can explicitly control this through the punctuation_overrides settings in the config.json file, for example:

"transcription_config": {
   "language": "en",
   "punctuation_overrides": {
      "permitted_marks":[ ".", "," ]
   }
}

Both plain text and JSON output supports punctuation. JSON output places punctuation marks in the results list marked with a type of "punctuation". So you can also filter on the output if you want to modify or remove punctuation.

Sample JSON output (json-v2 only) containing punctuation looks like this:

{
  "alternatives": [
    {
      "confidence": 1,
      "content": ",",
      "language": "en",
      "speaker": "UU"
    }
  ],
  "attaches_to": "previous",
  "end_time": 10.15,
  "is_eos": false,
  "start_time": 10.15,
  "type": "punctuation"
}

If you specify the punctuation_overrides element for languages that do not yet support advanced punctuation then it is ignored.

The list of languages that support Advanced Punctuation is as follows:

  • Arabic
  • Danish
  • Dutch
  • English
  • French
  • German
  • Malay
  • Spanish
  • Swedish
  • Turkish

*Note: Disabling punctuation may slightly harm the accuracy of speaker diarization. Please see the "Speaker diarization post-processing" section in these docs for more information.

Notifications

Customers can poll the appliance to check on the status of the job, before making the call to retrieve the transcript. Where many jobs are being done at scale, this may not be sustainable. A more convenient method - and recommended approach - is to use notifications. This involves a callback to a web service that you control once a job is complete. An HTTP POST request is then made from the Speechmatics appliance once the transcript is available. PUT is also supported where specified

The notification support offered in V1 has been extended and generalized in V2 to support a wider range of customer integration scenarios:

  • A callback does not need to include any data attachments, and can be just a signal that the transcript is ready to be fetched through the normal API.
  • Multiple pieces of content can be sent as multiple attachments in one request, allowing multiple combination of the input(s) and output(s) of the job to be forwarded to another processing stage. The exception is the audio file, which is deleted upon completion of the transcript. Formatting options for outputs can be specified per attachment.
  • You can setup multiple notifications to up to 3 different endpoints: for instance you can send a jobinfo notification to one service, and the transcript notification to another.
  • Callbacks with a single attachment will send the content item as the HTTP request body, rather than using multipart mode. This allows writing an individual item to an object store like Amazon S3.
  • HTTP PUT methods are now supported to allow uploading of content directly to an object store such as S3.
    • A set of additional HTTP request headers can be specified in order:
      • To satisfy authentication / authorization requirements for systems that do not support auth tokens in query parameters.
      • To control behaviour of an object store or another existing service endpoint.
  • Multiple callbacks can be specified per job.
    • This allows sending individual pieces of content to different URLs.
    • It allows sending combinations of the inputs/outputs to multiple destinations, to support a fanout workflow.
    • Callbacks will be invoked in parallel and so may complete in any order. If a downstream workflow depends on getting several items of content delivered as separate callbacks (eg. uploaded as separate items to S3), then the downstream processing logic will need to be robust to the ordering of upload content, and the possibility that only some might succeed.

Important Notice

In the Batch Virtual Appliance, a user cannot request the audio file that was part of the original job submission.

Configuring the Callback

The callback is specified by using the notification_config within the config object. For example:

curl -X POST 'https://{APPLIANCE_HOST}/v2/jobs' \
   --form data_file=@example.wav \
   --form config='{
     "type": "transcription",
     "transcription_config": { "language": "en" },
     "notification_config": [
       {
         "url": "https://collector.example.org/callback",
         "contents": [ "transcript" ],
         "auth_headers": [
           "Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhb"
         ]
       }
     ]
   }

Accepting the Callback

You need to ensure that the service that you implement to receive the callback notification is capable of processing the Speechmatics transcript using the format that has been specified in the config JSON. When testing your integration you should check the error logs on your web service to ensure that notifications are being accepted and processed correctly.

The callback appends the job ID as a query string parameter with name id, as well as the status of the job. As an example, if the job ID is 100, you'd see the following POST request:

POST /callback?id=100&status=success HTTP/1.1
Host: collector.example.org

The user agent is Speechmatics-API/2.0.

Configuring your webserver to accept the Callback

Once transcription is complete and the transcript file is available, the Speechmatics Batch Virtual Appliance will send the transcript file in a HTTP POST request (unless otherwise specified) to the client web server specified in the notification_config config object. If the appliance does not receive successful 2xx response it will keep trying to send the file until it reaches the set timeout threshold.

If the clients webserver cannot accept the file(s) because it is not configured with a large enough size limit, it will generate a 413 (Request Entity Too Large) response. If the appliance does not receive a 2xx response it will continue to retry sending the file. Users are recommended to check their webserver size limits to ensure they are adequate for the files that will be sent.

Metadata and Job Tracking

It is now possible to attach richer metadata to a job using the tracking configuration. This metadata can be used to identify transcripts for appropriate data storage and classification, especially where they may have passed through multiple systems using whatever information is relevant to you. The tracking object contains the following properties:

NameTypeDescriptionNotes
titlestrThe title of the job.[optional]
referencestrExternal system reference.[optional]
tagslist[str]Customer-defined tags[optional]
detailsobjectCustomer-defined JSON structure.[optional]

Here is an example

curl  -X POST 'https://${APPLIANCE_HOST}/v2/jobs' \
   -H 'Authorization: Bearer NDFjOTE3NGEtOWVm' \
   --form data_file=@example.wav \
   --form config='{
     "type": "transcription",
     "transcription_config": { "language": "en" },
     "tracking": {
        "title": "ACME Q12018 Statement",
        "reference": "/data/clients/ACME/statements/segs/2018Q1-seg8",
        "tags": [ "quick-review", "segment" ],
        "details": {
           "client": "ACME Corp",
           "segment": 8,
           "seg_start": 963.201,
           "seg_end": 1091.481
         }
      }
   }

SubRip Subtitling Format

SubRip (SRT) is a subtitling format that can be used in to generate subtitles for video content or other workflows. Our SRT output will generate a transcript together with corresponding alignment timestamps. We follow best practice as recommended by major broadcasters in our default line length and number of lines output.

Speechmatics provides a default configuration output for SRT files for both number of lines and line length in characters. You can change these parameters, by passing configuration options described below. To alter default parameters, you must make parameter changes within the configuration file:

{
  "type": "transcription",
  "transcription_config": {
    ...
  },
  "output_config": {
    "srt_overrides": {
      "max_line_length": 37,
      "max_lines": 2
    }
  }
}
  • max_line_length: sets maximum count of characters per subtitle line including white space (default: 37).
  • max_lines: sets maximum count of lines in a subtitle section (default: 2).

Word Tagging

Speechmatics now outputs in JSON transcript only a metadata tag to indicate whether a word is a profanity or a disfluency. This is for the English language pack only, and the list of profanities and disfluences are not alterable. Users do not have to take any action to access tese - they are provided in our JSON output as standard Customers can use this tag for their own post-processing in order to identify these types of words.

Profanity Tagging

An example of how a profanity would be tagged is below:

"results": [
{
      "alternatives": [
        {
          "confidence": 1.0,
          "content": "$PROFANITY",
          "language": "en",
          "speaker": "UU",
          "tags": [
            "profanity"
          ]
        }
      ],
      "end_time": 18.03,
      "start_time": 17.61,
      "type": "word"
    }
]

Disfluency Tagging

A disfluency here refers to a set list of words in English that imply hesitation or indecision. Please note while disfluency can cover a range of items like stuttering and interjections, here it is only used to tag words such as 'hmm' or 'umm'. An example of how this looks is below:

"results": [
{
      "alternatives": [
        {
          "confidence": 1.0,
          "content": "hmm",
          "language": "en",
          "speaker": "UU",
          "tags": [
            "disfluency"
          ]
        }
      ],
      "end_time": 18.03,
      "start_time": 17.61,
      "type": "word"
    }
]

Getting a Job log file

In case something unexpected happens with your transcription job, you can use the V2 API to retrieve logging for any job. This can be used for internal debugging and troubleshooting, or for providing more information to Speechmatics Support in the event of continued failure.

This feature is available only when a Job ID is generated and returned at the audio submission time. If the audio upload fails and no Job ID is returned, the log will not be available. If a user submits a job and gets a 401 error back (for example) rather than a Job ID, we won't provide logs via this endpoint. The transcription job log is available when the job finishes successfully, but there was then an error with the file processing or the transcript retrieval failed (e.g. an HTTP 500 error when retrieving the transcript).

You must include the Job ID in the request to retrieve logs for any job. You can only request logs from one job ID at a time. Here is a simple example URL:

curl -X GET 'https://${APPLIANCE_HOST}/v2/jobs/$JOBID/log'