/
Batch Virtual Appliance
/
API Guide
/
How to use the V2 API
This is the documentation for a previous version of our product. Click here to see the latest version.

Example Usage

The Speech API is a REST API that enables you to create and manage transcription jobs by uploading audio files to the Speechmatics Batch Virtual Appliance, and downloading the resulting transcriptions.

If you are familiar with the Speechmatics SaaS then you will see that the Speech API for the Batch Virtual Appliance is very similar. With the Speech API on the Batch Virtual Appliance however, there is no need to supply an API Authentication Token.

User ID in request URL

Although you do not need an API auth token, you do need to supply a User ID in the URL, although this is a dummy value to maintain compatibility with the V1 SaaS API. It can be any positive integer, and can be used to track transcription requests since the ID that you use will be returned in any job response.

The base URI for the Speech API requests looks like this (HTTP):

http://${APPLIANCE_HOST}:8082/v1/user/1/jobs/

or this (HTTPS):

https://${APPLIANCE_HOST}/v1/user/1/jobs/

Where ${APPLIANCE_HOST} is the IP address or hostname of the appliance you want to use. You must use port 8082 if you are using HTTP. If HTTPS is used, then port 443 is used (as this is a default port it is not necessary to specify it). A user ID of 1 is used for all the examples in this document (you can use any positive integer value however). This guide shows HTTP URLs, but however, since appliance version 3.4.0 you can also use HTTPS URLs.

You can access a dashboard on the appliance from any browser by navigating to the following URL:

http://${APPLIANCE_HOST}:8080/help

From here you can get to documentation links, and to Swagger UI pages which let you access the Managament API and the Speech API. The Speech API can be used to easily submit and view jobs from a browser:

http://${APPLIANCE_HOST}:8082/v1/docs
Unsupported parameters in Swagger UI

The Batch Virtual Appliance supports many of the same job settings as the V1 SaaS; however the following fields are not supported and will produce a '501 Not Implemented' error if you attempt to use them: data_url, text_url, text_file and notification_email_address.

All examples in this document use curl to make the REST API call from a command line. We recommend using retry parameters, so that retry attempts can be made for at least one minute. With the curl command this is done with the --retry 5 --retry-delay 10 parameters. This has been omitted from the examples in this document for brevity.

Note: If you are using a self-signed certificate (your own, or the Speechmatics certificate that is used by default), then you will see a warning like this when using the curl command to access the Speech API using HTTPS:

curl: (60) SSL certificate problem: self signed certificate

We recommend, if you are going to use the secure Speech API, that you upload your own SSL certificate (signed by a CA) to the appliance, to avoid this problem. See the Install and Admin Guide for details of how to do this.

All successful API calls return the HTTP body in JSON format along with status code 200 OK. Users of curl will see this displayed in their terminal; other interfaces may need a JSON parser.

Submitting a Job

Sample Request

The simplest example to get going is to submit an audio file for transcription. This is done by making a POST request with the audio file and the language model you want to use:

curl -X POST "http://${APPLIANCE_HOST}:8082/v1/user/1/jobs/" \
  -H 'Content-Type: multipart/form-data' \
  -H 'Accept: application/json' \
  -F data_file=@example.wav \
  -F 'config={ "type": "transcription", "transcription_config": { "language": "en" } }'

The data_file form field is used to submit the audio, and a config object is passed in with details of how to transcribe the file (in this case, using the English language model).

Example Response

On successful submission of the job a 200 OK status will be returned by the appliance, along with JSON output showing the id of the job, and an indication in the check_wait property of how many seconds to wait before checking that the transcription is done:

The response headers returned will look like this:

{
  "date": "Wed, 24 Jul 2019 16:35:25 GMT",
  "server": "nginx/1.15.10",
  "connection": "keep-alive",
  "content-length": "1479",
  "content-type": "application/json"
}

And the body will comprise a JSON object like this:

{
  "balance": 0,
  "check_wait": 30,
  "cost": 0,
  "id": 111
}

Note: The balance and cost properties are not used by the Batch Virtual Appliance. They will always return zero values.

Retrieving job status

curl -X GET "http://${APPLIANCE_HOST}:8082/v1/user/1/jobs/111/"

Expected response:

{
  "job": {
    "check_wait": null,
    "created_at": "Thu May 28 13:16:47 2020",
    "duration": 4,
    "id": 111,
    "job_status": "done",
    "job_type": "transcription",
    "lang": "en",
    "meta": null,
    "name": "example.wav",
    "next_check": 0,
    "notification": "none",
    "transcription": "output.json",
    "user_id": 1
  }
}

Retrieving transcript

Transcript can be requested in formats describer in Output Formats section.

curl -X GET "http://${APPLIANCE_HOST}:8082/v1/user/1/jobs/111/transcript?format=txt"

Supplying a Job Configuration

The config object parameter is used to pass information about these features into the appliance. This is the recommended approach for passing information about the transcription job to the appliance. The simplest config object is where just the transcription language is specified, for example:

{"type": "transcription",
   "transcription_config": {"language": "en"}
}

Example Configurations

Examples of configurations for specific features are shown in this section. See the complete reference section for more detail.

Speaker Diarization

There are two modes of diarization that can be used. The simplest, speaker diarization, aggregates all input channels into a single stream for processing, and picks out different speakers based on acoustic matching.

{"type": "transcription",
   "transcription_config": {
      "diarization": "speaker",
      "language": "en"
   }
}

Channel Diarization

The other diarization mode, channel diarization, processes multiple input channels or streams individually and applies a label to each (through use of the channel_diarization_labels list).

{"type": "transcription",
   "transcription_config": {
      "diarization": "channel",
      "channel_diarization_labels": ["Agent", "Caller"],
      "language": "en"
   }
}

Speaker Change Detection

This feature allows changes in the speaker to be detected and then marked in the transcript. Typically it is used to make some changes in the user interface to indicate to the reader that someone else is talking. Detection of speaker change is done without detecting which segments were spoken by the same speaker. The config used to request speaker change detection looks like this:

{"type": "transcription",
   "transcription_config": {
      "diarization": "speaker_change",
      "speaker_change_sensitivity": 0.8
   }
}

Note: Speaker change is only recorded as JSON V2 output, so make sure you use the json-v2 format when you retrieve the transcript.

The speaker_change_sensitivity property, if used, must be a numeric value between 0 and 1. It indicates to the algorithm how sensitive to speaker change events you want to make it. A low value will mean that very few changes will be signalled (with higher possibility of false negatives), whilst a high value will mean you will see more changes in the output (with higher possibility of false positives). If this property is not specified, a default of 0.4 is used.

Speaker change elements in the results array appear like this:

{
  "type": "speaker_change",
  "start_time": 0.55,
  "end_time": 0.55,
  "alternatives": []
}

Note: Although there is an alternatives property in the speaker change element it is always empty, and can be ignored. The start_time and end_time properties are always identical, and provide the time when the change was detected.

A speaker change indicates where we think a different person has started talking. For example, if one person says "Hello James" and the other responds with "Hi", there should be a speaker_change element between "James" and "Hi", for example:

{
    "format": "2.4",
    "job": {
....
  "results": [
    {
      "start_time": 0.1,
      "end_time": 0.22,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "Hello",
            "language": "en",
            "speaker": "UU"
        }
      ]
    },
    {
      "start_time": 0.22,
      "end_time": 0.55,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "James",
            "language": "en",
            "speaker": "UU"
        }
      ]
    },
    {
      "start_time": 0.55,
      "end_time": 0.55,
      "type": "speaker_change",
      "alternatives": []
    },
    {
      "start_time": 0.56,
      "end_time": 0.61,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "Hi",
            "language": "en",
            "speaker": "UU"
        }
      ]
    }
  ]
}

Speaker Change Detection With Channel Diarization

Speaker change can be combined with channel diarization. It will process channels separately and indicate in the output both the channels and the speaker changes. For example, if a two-channel audio contains two people greeting each other (both recorded over the same channel), the config submitted with the audio can request the speaker change detection:

{"type": "transcription",
   "transcription_config": {
      "diarization": "channel_and_speaker_change",
      "speaker_change_sensitivity": 0.8
   }
}

The output will have special elements in the results array between two words where a different person starts talking. For example, if one person says "Hello James" and the other responds with "Hi", there will a speaker_change json element between "James" and "Hi".

{
    "format": "2.4",
    "job": {
....
    },
    "metadata": {
....
    },
  "results": [
    {
      "channel": "channel_1",
      "start_time": 0.1,
      "end_time": 0.22,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "Hello",
            "language": "en",
            "speaker": "UU"
        }
      ]
    },
    {
      "channel": "channel_1",
      "start_time": 0.22,
      "end_time": 0.55,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "James",
            "language": "en",
            "speaker": "UU"
        }
      ]
    },
    {
      "channel": "channel_1",
      "start_time": 0.55,
      "end_time": 0.55,
      "type": "speaker_change",
      "alternatives": []
    },
    {
      "channel": "channel_1",
      "start_time": 0.56,
      "end_time": 0.61,
      "type": "word",
      "alternatives": [
        {
            "confidence": 0.71,
            "content": "Hi",
            "language": "en",
            "speaker": "UU"
        }
      ]
    }
  ]
}

Custom Dictionary

The Custom Dictionary feature can be accessed through the additional_vocab property. This is a list of custom words or phrases that should be recognized. Custom Dictionary Sounds is an extension to this to allow alternative pronunciations to be specified in order to aid recognition, or provide for alternative transcriptions.

{"type": "transcription",
   "transcription_config": {
      "language": "en",
      "additional_vocab": [
         { "content": "Ciarán", "sounds_like": [ "Kieran" ] },
         { "content": "FA", "sounds_like": [ "effeh" ] },
         "speechmatics"
      ]
   }
}

You can specify up to 1000 words or phrases per job in your custom dictionary.

Note: For the additional_vocab property you can provide the list of alternate pronunciations using an array of sounds_like words. In the simple case where there is no difference of pronunciation, you can conflate the content and sounds_like fields into a single bareword or phrase (for example "speechmatics"). The words in sounds_like must not contain whitespace characters. The additional_vocab property should be used for words that do not exist in the dictionary (in other words they are 'out of vocabulary'). The Custom Dictionary feature is designed to be used in environments where there is contextual information available (proper names, technical terms or other unusual words) that are likely to be out-of-vocabulary.

Output Locale

It is possible to specify the spelling rules to be used when generating the transcription, based on locale. The output_locale configuration setting is used for this. As an example, the following configuration uses the Global English (en) language pack with an output locale of British English (en-GB):

{ "type": "transcription",
  "transcription_config": {
    "language": "en",
    "output_locale": "en-GB"
  }
}

Currently, Global English is the only language pack that supports different output locales. The three locales that are available in this release are:

  • British English (en-GB)
  • US English (en-US)
  • Australian English (en-AU)

If no locale is specified then the ASR engine will use whatever spelling it has learnt as part of our language model training (in other words it will be based on the training data used).

Advanced Punctuation

Some language models (English, French, German and Spanish currently) now support advanced punctuation. This uses machine learning techniques to add in more naturalistic punctuation to make the transcript more readable. As well as putting punctuation marks in more naturalistic positions in the output, additional punctuation marks such as commas (,) and exclamation and question marks (!, ?) will also appear.

There is no need to explicitly enable this in the job configuration; languages that support advanced punctuation will automatically output these marks. If you do not want to see these punctuation marks in the output, then you can explicitly control this through the punctuation_overrides settings in the config.json file, for example:

"transcription_config": {
   "language": "en",
   "punctuation_overrides": {
      "permitted_marks":[ ".", "," ]
   }
}

Both plain text and JSON output supports punctuation. JSON output places punctuation marks in the results list marked with a type of "punctuation". So you can also filter on the output if you want to modify or remove punctuation.

Sample JSON output (json-v2 only) containing punctuation looks like this:

{
  "alternatives": [
    {
      "confidence": 1,
      "content": ",",
      "language": "en",
      "speaker": "UU"
    }
  ],
  "attaches_to": "previous",
  "end_time": 10.15,
  "is_eos": false,
  "start_time": 10.15,
  "type": "punctuation"
}

If you specify the punctuation_overrides element for languages that do not yet support advanced punctuation then it is ignored.

Output Formats

The default job transcription output format is json, which is the same format used by the Speechmatics SaaS. The json-v2 output format is also available with a richer set of information: you should use this format when you want to use newer features such as Channel Diarization. It is also possible to use a plain text transcription output if you do not need timing information.The SubRip subtitling (SRT) format provides timing information as well as text. This text corresponds with broadcasting best practice on maximum character output per line and maximum line length. If you wish, you can alter these parameters via the transcription config.

Legacy API

If you want to re-use a client implementation that already uses the Speechmatics V1 SaaS API (app.speechmatics.com), then you can use the legacy request parameters to do this. The sample request shown above would look like this using the legacy API parameters:

curl -X POST "http://${APPLIANCE_HOST}:8082/v1/user/1/jobs/" \
  -H 'Content-Type: multipart/form-data' \
  -H 'Accept: application/json' \
  -F data_file=@example.wav \
  -F model=en

The following legacy parameters are available with the Batch Virtual Appliance:

ParameterDescriptionNotes
modelLanguage model used to process the job.Replaced by the language property in the config object.
notificationHow you would like to be notified of your job finishing.The email notification type is not supported by the Batch Virtual Appliance.
callbackIf set, and notification is set to 'callback', the appliance will make a POST request to this URL when the job completes.
callback_formatThe format to be used by the appliance for the callback POST request.Available formats are: 'srt', 'txt', 'json' and 'json-v2'.
metaMetadata about the job you would like to be able to view later.
diarizationControls whether speaker diarization is used.Replaced by the "diarization": "speaker", property in the config object.
diarisationA synonym for 'diarization'.See above.

It's recommended that you use the config object to pass the job configuration; the other methods of specifying job configuration will be deprecated at some point in the future. However, if you want to pass metadata with the job, or use a notification callback then you can do so using the legacy API parameters.

Passing metadata and using callback notifications

The next sections describe how to pass metadata and use notification callbacks using the legacy API in the event you need to use this functionality. These features will be covered by additional parameters in the config object in a future release.

Passing Metadata

You can use the meta parameter in the legacy API to associate metadata to the job, and use this for tracking the job through your workflow. For instance, you can use this to associate your own asset tag or job number, and retrieve it later on when you process the JSON transcript.

curl -X POST "http://${APPLIANCE_HOST}:8082/v1/user/1/jobs/" \
  -H 'Content-Type: multipart/form-data' \
  -H 'Accept: application/json' \
  -F data_file=@example.wav \
  -F 'meta'='asset-id=29309231123' \
  -F 'config={ "type": "transcription", "transcription_config": { "language": "en" } }'

You'll then see meta information when you query the job, or retrieve the (JSON) transcript:

 {
    "format": "2.4",
    "job": {
        "created_at": "Tue Nov 19 17:34:41 2019",
        "duration": 383,
        "id": 4,
        "lang": "en",
        "meta": "asset-id=29309231123",
        "name": "en.mp3",
        "user_id": 1
    },
    "metadata": {
        "created_at": "2019-11-19T17:36:04.525Z",
        "transcription_config": {
            "language": "en"
        },
        "type": "transcription"
    },

Callbacks Usage

If you want to trigger a callback, so that you don't have to keep polling the jobs endpoint, you can do so by using the notification and callback parameters in the legacy API. This ensures that the Batch Appliance will send a POST to an HTTP server once the job is complete; typically, you would maintain a service running on that HTTP server that listens for these POST events and then performs some action to process the transcription (for example by writing it into a database, or copying the transcription to a file for further processing). Here is an example of how to setup a callback:

curl -X POST "http://${APPLIANCE_HOST}:8082/v1/user/1/jobs/" \
  -H 'Content-Type: multipart/form-data' \
  -H 'Accept: application/json' \
  -F data_file=@example.wav \
  -F model=en \
  -F notification=callback \
  -F callback=http://www.example.com/transcript_callback \
  -F callback_format=txt

The callback appends the job ID as a query string parameter with name id. As an example, if the job ID is 546, you'd see the following POST request:

POST /transcript_callback?id=546 HTTP/1.1
Host: www.example.com

The user agent is Speechmatics-API/1.0.

Note: For future compatibility we recommend that you use the config object, as new features will require it.

SubRip Subtitling Format

SubRip (SRT) is a subtitling format that can be used in to generate subtitles for video content or other workflows. Our SRT output will generate a transcript together with corresponding alignment timestamps. We follow best practice as recommended by major broadcasters in our default line length and number of lines output.

Speechmatics provides a default configuration output for SRT files for both number of lines and line length in characters. You can change these parameters, by passing configuration options described below. To alter default parameters, you must make parameter changes within the configuration file:

{
  "type": "transcription",
  "transcription_config": {
    ...
  },
  "output_config": {
    "srt_overrides": {
      "max_line_length": 37,
      "max_lines": 2
    }
  }
}
  • max_line_length: sets maximum count of characters per subtitle line including white space (default: 37).
  • max_lines: sets maximum count of lines in a subtitle section (default: 2).