The V2 WebSocket Speech API aligns with other Speechmatics platforms such as the Batch Virtual Appliance and Speechmatics Cloud Offering.
To use the V2 API you use the '/v2' endpoint for the URI, for example:
ws://rt-asr.example.com:9000/v2
If you are using the Real-time Container then you will need to use the ws:// scheme, for example: ws://rt-asr.example.com:9000/v2
. If you need to access the Real-time Container over a secure WebSocket connection from you client, then you'll need to consider an SSL offload from a load-balancer or similar.
The V2 API is configured by sending a StartRecognition
message initially when the WebSocket connection begins. We have designed the format of this message to be very similar to the config.json
object that has been used for a while now with the Speechmatics batch mode platforms (Batch Virtual Appliance, Batch Container and Cloud Offering). The transcription_config
section of the message should be almost identical between the two modes. There are some minor differences (for example batch features a different set of diarization options, and real-time features some settings which don't apply to batch such as max_delay
).
A transcription_config
structure is used to specify various configuration values for the recognition engine when the StartRecognition
message is sent to the server. All values apart from language
are optional. Here's an example of calling the StartRecognition
message with this structure:
{
"message": "StartRecognition",
"transcription_config": {
"language": "en"
},
"audio_format": {
"type": "raw",
"encoding": "pcm_f32le",
"sample_rate": 16000
}
}
Once the websocket session is setup and you've successfully called StartRecognition
you'll receive a RecognitionStarted
message from server. You can then just to send the binary audio chunks, which we refer to as AddAudio
messages.
You would replace this in the V2 API with much simpler code:
// NEW V2 EXAMPLE
function addAudio(audioData) {
ws.send(audioData);
seqNoIn++;
}
We recommend that you do not send more than 10 seconds of audio data or 500 individual AddAudio messages ahead of time.
The AddTranscript
and AddPartialTranscript
messages from the server output a JSON format which aligns with the JSON output format used by other Speechmatics products. There is a now a results
list which contains the transcribed words and punctuation marks along with timings and confidence scores. Here's an example of a final transcript output:
{
"message":"AddTranscript",
"results":[
{
"start_time":0.11670026928186417,
"end_time":0.4049381613731384,
"alternatives":[
{
"content":"gale",
"confidence":0.7034434080123901
}
],
"type":"word"
},
{
"start_time":0.410246878862381,
"end_time":0.6299981474876404,
"alternatives":[
{
"content":"eight",
"confidence":0.670033872127533
}
],
"type":"word"
},
{
"start_time":0.6599999666213989,
"end_time":1.0799999237060547,
"alternatives":[
{
"content":"becoming",
"confidence":1.0
}
],
"type":"word"
},
{
"start_time":1.0799999237060547,
"end_time":1.6154180765151978,
"alternatives":[
{
"content":"cyclonic",
"confidence":1.0
}
],
"type":"word"
},
{
"start_time":1.6154180765151978,
"is_eos":true,
"end_time":1.6154180765151978,
"alternatives":[
{
"content":".",
"confidence":1.0
}
],
"type":"punctuation"
}
],
"metadata":{
"transcript":"gale eight becoming cyclonic.",
"start_time":190.65994262695312,
"end_time":194.46994256973267
},
"format":"2.6"
}
You can use the metadata.transcript
property to get the complete final transcript as a chunk of plain text. The format
property describes the exact version of the transcription output format, which is currently 2.6. This may change in future releases if the output format is updated.
Speechmatics supports two different models within each language pack; a standard or an enhanced model. The standard model is the faster of the two, whilst the enhanced model provides a higher accuracy, but a slower turnaround time.
The enhanced model is a premium model. Please contact your account manager or Speechmatics if you would like access to this feature.
An example of requesting the enhanced model is below
{
"message": "StartRecognition",
"audio_format": {
"type": "raw",
"encoding": "pcm_f32le",
"sample_rate": 16000
},
{
"type": "transcription",
"transcription_config": {
"language": "en",
"operating_point": "enhanced"
}
}
Please note: standard
, as well as being the default option, can also be explicitly requested with the operating_point
parameter.
Some language models (Arabic, Danish, Dutch, English, French, German, Malay, Spanish, Swedish and Turkish currently) support advanced punctuation. This uses machine learning techniques to add in more naturalistic punctuation, improving the readability of your transcripts. As well as putting punctuation marks in more naturalistic positions in the output, additional punctuation marks such as commas (,) exclamation marks (!) and question marks (?) will also appear.
There is no need to explicitly enable this in the configuration; languages that support advanced punctuation will automatically output these marks. If you do not want to see these punctuation marks in the output, then you can explicitly control this through the punctuation_overrides
setting within the transcription_config
object, for example:
"transcription_config": {
"language": "en",
"punctuation_overrides": {
"permitted_marks": [ "." ]
}
}
Note that changing the punctuation setting from the default can take a couple of seconds, which means if the user is using non-default neural punctuation sensitivity, after they send the StartRecognition
message, there will be a slight delay (2-3 seconds) before the RecognitionStarted
message is sent back.
The JSON output places punctuation marks in the results list marked with a type
of "punctuation"
. So you can also filter on the output if you want to modify or remove punctuation.
This section provides some client code samples that show simple usage of the V2 WebSockets Speech API. It shows how you can test your Real-Time Appliance or Container using a minimal WebSocket client.
The basic usage of the WebSockets interface from a JavaScript client is shown in this section. In the first instance you setup the connection to the server and define the various event handlers that are required:
var ws = new WebSocket('ws://rtc:9000/v2');
ws.binaryType = "arraybuffer";
ws.onopen = function(event) { onOpen(event) };
ws.onmessage = function(event) { onMessage(event) };
ws.onclose = function(event) { onClose(event) };
ws.onerror = function(event) { onError(event) };
Change the hostname from the above example to match the IP address or hostname of your Real-Time Appliance or Container. The port used is 9000 and you need to make sure that you add '/v2' to the WebSocket URI. Note that the Real-time Container only supports WebSocket (ws) protocol. You should also ensure that the binaryType property of the WebSocket object is set to "arraybuffer"
.
In the onopen
handler you initiate the session by sending the StartRecognition message to the server, for example:
function onOpen(evt) {
var msg = {
"message": "StartRecognition",
"transcription_config": {
"language": "en",
"output_locale": "en-GB"
},
"audio_format": {
"type": "raw",
"encoding": "pcm_s16le",
"sample_rate": 16000
}
};
ws.send(JSON.stringify(msg));
}
An onmessage
handler is where you will respond to the server-initiated messages sent by the appliance or container, and decide how to handle them.
Typically, this involves implementing functions to display or process data that you get back from the server.
function onMessage(evt) {
var objMsg = JSON.parse(evt.data);
switch (objMsg.message) {
case "RecognitionStarted":
recognitionStarted(objMsg); // TODO
break;
case "AudioAdded":
audioAdded(objMsg); // TODO
break;
case "AddPartialTranscript":
case "AddTranscript":
transcriptOutput(objMsg); // TODO
break;
case "EndOfTranscript":
endTranscript(); // TODO
break;
case "Info":
case "Warning":
case "Error":
showMessage(objMsg); // TODO
break;
default:
console.log("UNKNOWN MESSAGE: " + objMsg.message);
}
}
Once the WebSocket is initialized, the StartRecognition message is sent to the appliance or container to setup the audio input. It is then a matter of sending audio data periodically using the AddAudio message.
Your AddAudio message will take audio from a source (for example microphone input, or an audio stream) and pass it to the Real-Time Appliance or Container.
// Send audio data to the API using the AddData message.
function addAudio(audioData) {
ws.send(audioData);
seqNoIn++;
}
In this example we use a counter seqNoIn
to keep track of the AddAudio messages we've sent.
A set of server-initiated transcript messages are triggered to indicate the availability of transcribed text:
AddTranscript
AddPartialTranscript
See above for changes to the JSON output schema in the V2 API. For full details of the output schema refer to the AddTranscript section in the API reference.
Finally, the client should send an EndOfStream message and close the WebSocket when it terminates. This should be done in order to release resources on the appliance or container and allow other clients to connect and use resources.
The Mozilla developer network provides a useful reference to the WebSocket API.
If you are using the Real-Time Container, you can use a Python library called speechmatics-python
. Please contact support@speechmatics.com if you require this library. You can also use this library for the Real-Time Virtual Appliance.
The speechmatics-python
library can be incorporated into your own applications, used as a reference for your own client library, or called directly from the command line (CLI) like this (to pass a test audio file to the appliance or container):
speechmatics transcribe --url ws://rtc:9000/v2 --lang en --ssl-mode none test.mp3
Note that configuration options are specified on the command-line as parameters, with a '_' character in the configuration option being replaced by a '-'. The CLI option accepts an audio stream on standard input, meaning that you can stream in a live microphone feed. To get help on the CLI use the following command:
speechmatics transcribe --help
The library depends on Python 3.7 or above, since it makes use of some of the newer asyncio
features introduced with Python 3.7.