The Speech API is a REST API that enables you to create and manage transcription jobs by uploading audio files to the Speechmatics Batch Virtual Appliance, and downloading the resulting transcriptions.
If you are familiar with the Speechmatics SaaS then you will see that the Speech API for the Batch Virtual Appliance is very similar. With the Speech API on the Batch Virtual Appliance however, there is no need to supply an API Authentication Token.
Although you do not need an API auth token, you do need to supply a User ID in the URL, although this is a dummy value to maintain compatibility with the V1 SaaS API. It can be any positive integer, and can be used to track transcription requests since the ID that you use will be returned in any job response.
The base URI for the Speech API requests looks like this (HTTP):
http://${APPLIANCE HOST}:8082/v1/user/1/jobs/
or this (HTTPS):
https://${APPLIANCE HOST}/v1/user/1/jobs/
Where ${APPLIANCE HOST} is the IP address or hostname of the appliance you want to use. You must use port 8082 if you are using HTTP. If HTTPS is used, then port 443 is used (as this is a default port it is not necessary to specify it). A user ID of 1 is used for all the examples in this document (you can use any positive integer value however). This guide shows HTTP URLs, but however, since appliance version 3.4.0 you can also use HTTPS URLs.
You can access a dashboard on the appliance from any browser by navigating to the following URL:
http://${APPLIANCE HOST}:8080/help
From here you can get to documentation links, and to Swagger UI pages which let you access the Managament API and the Speech API. The Speech API can be used to easily submit and view jobs from a browser:
http://${APPLIANCE HOST}:8082/v1
The Batch Virtual Appliance supports many of the same job settings as the V1 SaaS; however the following fields are not supported and will produce a '501 Not Implemented' error if you attempt to use them: data_url
, text_url
, text_file
and notification_email_address
.
All examples in this document use curl to make the REST API call from a command line. We recommend using retry parameters, so that retry attempts can be made for at least one minute. With the curl command this is done with the --retry 5 --retry-delay 10
parameters. This has been omitted from the examples in this document for brevity.
Note: If you are using a self-signed certificate (your own, or the Speechmatics certificate that is used by default), then you will see a warning like this when using the curl command to access the Speech API using HTTPS:
curl: (60) SSL certificate problem: self signed certificate
We recommend, if you are going to use the secure Speech API, that you upload your own SSL certificate (signed by a CA) to the appliance, to avoid this problem. See the Install and Admin Guide for details of how to do this.
All successful API calls return the HTTP body in JSON format along with status code 200 OK. Users of curl will see this displayed in their terminal; other interfaces may need a JSON parser.
The simplest example to get going is to submit an audio file for transcription. This is done by making a POST request with the audio file and the language model you want to use:
curl -X POST "http://${APPLIANCE HOST}:8082/v1/user/1/jobs/" \
-H 'Content-Type: multipart/form-data' \
-H 'Accept: application/json' \
-F data_file=@example.wav \
-F 'config={ "type": "transcription", "transcription_config": { "language": "en" } }'
The data_file
form field is used to submit the audio, and a config
object is passed in with details of how to transcribe the file (in this case, using the English language model).
On successful submission of the job a 200 OK status will be returned by the appliance, along with JSON output showing the id of the job, and an indication in the check_wait property of how many seconds to wait before checking that the transcription is done:
The response headers returned will look like this:
{
"date": "Wed, 24 Jul 2019 16:35:25 GMT",
"server": "nginx/1.15.10",
"connection": "keep-alive",
"content-length": "1479",
"content-type": "application/json"
}
And the body will comprise a JSON object like this:
{
"balance": 0,
"check_wait": 30,
"cost": 0,
"id": 111
}
Note: The balance and cost properties are not used by the Batch Virtual Appliance. They will always return zero values.
The config
object parameter is used to pass information about these features into the appliance. This is the recommended approach for passing information about the transcription job to the appliance. The simplest config object is where just the transcription language is specified, for example:
{"type": "transcription",
"transcription_config": {"language": "en"}
}
Examples of configurations for specific features are shown in this section. See the complete reference section for more detail.
There are two modes of diarization that can be used. The simplest, speaker diarization, aggregates all input channels into a single stream for processing, and picks out different speakers based on acoustic matching.
{"type": "transcription",
"transcription_config": {
"diarization": "speaker",
"language": "en"
}
}
The other diarization mode, channel diarization, processes multiple input channels or streams individually and applies a label to each (through use of the channel_diarization_labels list).
{"type": "transcription",
"transcription_config": {
"diarization": "channel",
"channel_diarization_labels": ["Agent", "Caller"],
"language": "en"
}
}
This feature allows changes in the speaker to be detected and then marked in the transcript. Typically it is used to make some changes in the user interface to indicate to the reader that someone else is talking. Detection of speaker change is done without detecting which segments were spoken by the same speaker. The config used to request speaker change detection looks like this:
{"type": "transcription",
"transcription_config": {
"diarization": "speaker_change",
"speaker_change_sensitivity": 0.8
}
}
Note: Speaker change is only recorded as JSON V2 output, so make sure you use the json-v2
format when you retrieve the transcript.
The speaker_change_sensitivity
property, if used, must be a numeric value between 0 and 1. It indicates to the algorithm how sensitive to speaker change events you want to make it. A low value will mean that very few changes will be signalled (with higher possibility of false negatives), whilst a high value will mean you will see more changes in the output (with higher possibility of false positives). If this property is not specified, a default of 0.4 is used.
Speaker change elements in the results
array appear like this:
{
"type": "speaker_change",
"start_time": 0.55,
"end_time": 0.55,
"alternatives": []
}
Note: Although there is an alternatives
property in the speaker change element it is always empty, and can be ignored. The start_time
and end_time
properties are always identical, and provide the time when the change was detected.
A speaker change indicates where we think a different person has started talking. For example, if one person says "Hello James" and the other responds with "Hi", there should be a speaker_change
element between "James" and "Hi", for example:
{
"format": "2.4",
"job": {
....
"results": [
{
"start_time": 0.1,
"end_time": 0.22,
"type": "word",
"alternatives": [
{
"confidence": 0.71,
"content": "Hello",
"language": "en",
"speaker": "UU"
}
]
},
{
"start_time": 0.22,
"end_time": 0.55,
"type": "word",
"alternatives": [
{
"confidence": 0.71,
"content": "James",
"language": "en",
"speaker": "UU"
}
]
},
{
"start_time": 0.55,
"end_time": 0.55,
"type": "speaker_change",
"alternatives": []
},
{
"start_time": 0.56,
"end_time": 0.61,
"type": "word",
"alternatives": [
{
"confidence": 0.71,
"content": "Hi",
"language": "en",
"speaker": "UU"
}
]
}
]
}
Speaker change can be combined with channel diarization. It will process channels separately and indicate in the output both the channels and the speaker changes. For example, if a two-channel audio contains two people greeting each other (both recorded over the same channel), the config submitted with the audio can request the speaker change detection:
{"type": "transcription",
"transcription_config": {
"diarization": "channel_and_speaker_change",
"speaker_change_sensitivity": 0.8
}
}
The output will have special elements in the results
array between two words where a different person starts talking. For example, if one person says "Hello James" and the other responds with "Hi", there will a speaker_change
json element between "James" and "Hi".
{
"format": "2.4",
"job": {
....
},
"metadata": {
....
},
"results": [
{
"channel": "channel_1",
"start_time": 0.1,
"end_time": 0.22,
"type": "word",
"alternatives": [
{
"confidence": 0.71,
"content": "Hello",
"language": "en",
"speaker": "UU"
}
]
},
{
"channel": "channel_1",
"start_time": 0.22,
"end_time": 0.55,
"type": "word",
"alternatives": [
{
"confidence": 0.71,
"content": "James",
"language": "en",
"speaker": "UU"
}
]
},
{
"channel": "channel_1",
"start_time": 0.55,
"end_time": 0.55,
"type": "speaker_change",
"alternatives": []
},
{
"channel": "channel_1",
"start_time": 0.56,
"end_time": 0.61,
"type": "word",
"alternatives": [
{
"confidence": 0.71,
"content": "Hi",
"language": "en",
"speaker": "UU"
}
]
}
]
}
The Custom Dictionary feature can be accessed through the additional_vocab property. This is a list of custom words or phrases that should be recognized. Custom Dictionary Sounds is an extension to this to allow alternative pronunciations to be specified in order to aid recognition, or provide for alternative transcriptions.
{"type": "transcription",
"transcription_config": {
"language": "en",
"additional_vocab": [
{ "content": "Ciarán", "sounds_like": [ "Kieran" ] },
{ "content": "FA", "sounds_like": [ "effeh" ] },
"speechmatics"
]
}
}
You can specify up to 1000 words or phrases per job in your custom dictionary.
Note: For the additional_vocab
property you can provide the list of alternate pronunciations using an array of sounds_like
words. In the simple case where there is no difference of pronunciation, you can conflate the content and sounds_like fields into a single bareword or phrase (for example "speechmatics"). The words in sounds_like
must not contain whitespace characters. The additional_vocab
property should be used for words that do not exist in the dictionary (in other words they are 'out of vocabulary'). The Custom Dictionary feature is designed to be used in environments where there is contextual information available (proper names, technical terms or other unusual words) that are likely to be out-of-vocabulary.
It is possible to specify the spelling rules to be used when generating the transcription, based on locale. The output_locale
configuration setting is used for this. As an example, the following configuration uses the Global English (en) language pack with an output locale of British English (en-GB):
{ "type": "transcription",
"transcription_config": {
"language": "en",
"output_locale": "en-GB"
}
}
Currently, Global English is the only language pack that supports different output locales. The three locales that are available in this release are:
If no locale is specified then the ASR engine will use whatever spelling it has learnt as part of our language model training (in other words it will be based on the training data used).
Some language models (English, French, German and Spanish currently) now support advanced punctuation. This uses machine learning techniques to add in more naturalistic punctuation to make the transcript more readable. As well as putting punctuation marks in more naturalistic positions in the output, additional punctuation marks such as commas (,) and exclamation and question marks (!, ?) will also appear.
There is no need to explicitly enable this in the job configuration; languages that support advanced punctuation will automatically output these marks. If you do not want to see these punctuation marks in the output, then you can explicitly control this through the punctuation_overrides
settings in the config.json file, for example:
"transcription_config": {
"language": "en",
"punctuation_overrides": {
"permitted_marks":[ ".", "," ]
}
}
Both plain text and JSON output supports punctuation. JSON output places punctuation marks in the results list marked with a type
of "punctuation"
. So you can also filter on the output if you want to modify or remove punctuation.
Sample JSON output (json-v2
only) containing punctuation looks like this:
{
"alternatives": [
{
"confidence": 1,
"content": ",",
"language": "en",
"speaker": "UU"
}
],
"attaches_to": "previous",
"end_time": 10.15,
"is_eos": false,
"start_time": 10.15,
"type": "punctuation"
}
If you specify the punctuation_overrides element for languages that do not yet support advanced punctuation then it is ignored.
The default job transcription output format is json, which is the same format used by the Speechmatics SaaS. The json-v2 output format is also available with a richer set of information: you should use this format when you want to use newer features such as Channel Diarization. It is also possible to use a plain text transcription output if you do not need timing information.
If you want to re-use a client implementation that already uses the Speechmatics V1 SaaS API (app.speechmatics.com), then you can use the legacy request parameters to do this. The sample request shown above would look like this using the legacy API parameters:
curl -X POST "http://${APPLIANCE HOST}:8082/v1/user/1/jobs/" \
-H 'Content-Type: multipart/form-data' \
-H 'Accept: application/json' \
-F data_file=@example.wav \
-F model=en
The following legacy parameters are available with the Batch Virtual Appliance:
Parameter | Description | Notes |
---|---|---|
model | Language model used to process the job. | Replaced by the language property in the config object. |
notification | How you would like to be notified of your job finishing. | The email notification type is not supported by the Batch Virtual Appliance. |
callback | If set, and notification is set to 'callback', the appliance will make a POST request to this URL when the job completes. | |
callback_format | The format to be used by the appliance for the callback POST request. | Available formats are: 'txt' , 'json' and 'json-v2' . |
meta | Metadata about the job you would like to be able to view later. | |
diarization | Controls whether speaker diarization is used. | Replaced by the "diarization": "speaker", property in the config object. |
diarisation | A synonym for 'diarization'. | See above. |
It's recommended that you use the config object to pass the job configuration; the other methods of specifying job configuration will be deprecated at some point in the future. However, if you want to pass metadata with the job, or use a notification callback then you can do so using the legacy API parameters.
The next sections describe how to pass metadata and use notification callbacks using the legacy API in the event you need to use this functionality. These features will be covered by additional parameters in the config object in a future release.
You can use the meta
parameter in the legacy API to associate metadata to the job, and use this for tracking the job through your worfflow. For instance, you can use this to associate your own asset tag or job number, and retrieve it later on when you process the JSON transcript.
curl -X POST "http://${APPLIANCE_HOST}:8082/v1/user/1/jobs/" \
-H 'Content-Type: multipart/form-data' \
-H 'Accept: application/json' \
-F data_file=@example.wav \
-F 'meta'='asset-id=29309231123' \
-F 'config={ "type": "transcription", "transcription_config": { "language": "en" } }'
You'll then see meta information when you query the job, or retreive the (JSON) transcript:
{
"format": "2.4",
"job": {
"created_at": "Tue Nov 19 17:34:41 2019",
"duration": 383,
"id": 4,
"lang": "en",
"meta": "asset-id=29309231123",
"name": "en.mp3",
"user_id": 1
},
"metadata": {
"created_at": "2019-11-19T17:36:04.525Z",
"transcription_config": {
"language": "en"
},
"type": "transcription"
},
If you want to trigger a callback, so that you don't have to keep polling the jobs endpoint, you can do so by using the notification and callback parameters in the legacy API. This ensures that the Batch Appliance will send a POST to an HTTP server once the job is complete; typically, you would maintain a service running on that HTTP server that listens for these POST events and then performs some action to process the transcription (for example by writing it into a database, or copying the transcription to a file for further processing). Here is an example of how to setup a callback:
curl -X POST "http://${APPLIANCE_HOST}:8082/v1/user/1/jobs/" \
-H 'Content-Type: multipart/form-data' \
-H 'Accept: application/json' \
-F data_file=@example.wav \
-F model=en \
-F notification=callback \
-F callback=http://www.example.com/transcript_callback \
-F callback_format=txt
The callback appends the job ID as a query string parameter with name id
. As an example, if the job ID is 546, you'd see the following POST request:
POST /transcript_callback?id=546 HTTP/1.1
Host: www.example.com
The user agent is Speechmatics-API/1.0
.
Note: For future compatibility we recommend that you use the config object, as new features will require it.