Example Connection to the API

The WebSocket Speech API aligns with other Speechmatics platforms such as the Batch Virtual Appliane and Speechmatics SaaS.

WebSocket URI

To use the V2.7 API you use the '/v2' endpoint for the URI, for example:

wss://rt-asr.example.com:9000/v2

Session Configuration

The V2 API is configured by sending a StartRecognition message initially when the WebSocket connection begins. We have designed the format of this message to be very similar to the config.json object that has been used for a while now with the Speechmatics batch mode platforms (Batch Virtual Appliance, Batch Container and SaaS). The transcription_config section of the message should be almost identical between the two modes. There are some minor differences (for example batch features a different set of diarization options, and real-time features some settings which don't apply to batch such as max_delay).

TranscriptionConfig

A transcription_config structure is used to specify various configuration values for the recognition engine when the StartRecognition message is sent to the server. All values apart from language are optional. Here's an example of calling the StartRecognition message with this structure:

{
        "message": "StartRecognition",
        "transcription_config": {
            "language": "en"
        },
        "audio_format": {
            "type": "raw",
            "encoding": "pcm_f32le",
            "sample_rate": 16000
        }
    }
}

AddAudio

Once the websocket session is setup and you've sucessfully called StartRecognition you'll receive a RecognitionStarted message from server. You can then just to send the binary audio chunks, which we refer to as AddAudio messages.

You would replace this in the V2 API with much simpler code:

// NEW V2 EXAMPLE
function addAudio(audioData) {
    ws.send(audioData);
    seqNoIn++;
}

Speechmatics real-time Speech API will tolerate no more than 10 seconds of audio data or 500 individual AddAudio messages ahead of time. If you send more than this amount you will not receive an AudioAdded response until there is capacity in the buffer. This is to prevent any slowdown in latency and system performance

If you have implemented your own client-side solution and/or wrapper, one possible blocking implementation of the rate-limiting is a semaphore of size 500, acquired before sending each AddAudio message, and released after receiving any AudioAdded message. Make sure receiving messages runs in another thread or uses some other mechanism to avoid getting blocked by the semaphore.

Final and Partial Transcripts

The AddTranscript and AddPartialTranscript messages from the server output a JSON format which aligns with the JSON output format used by other Speechmatics products. There is a now a results list which contains the transcribed words and punctuation marks along with timings and confidence scores. Here's an example of a final transcript output:

{
   "message":"AddTranscript",
   "results":[
      {
         "start_time":0.11670026928186417,
         "end_time":0.4049381613731384,
         "alternatives":[
            {
               "content":"gale",
               "confidence":0.7034434080123901
            }
         ],
         "type":"word"
      },
      {
         "start_time":0.410246878862381,
         "end_time":0.6299981474876404,
         "alternatives":[
            {
               "content":"eight",
               "confidence":0.670033872127533
            }
         ],
         "type":"word"
      },
      {
         "start_time":0.6599999666213989,
         "end_time":1.0799999237060547,
         "alternatives":[
            {
               "content":"becoming",
               "confidence":1.0
            }
         ],
         "type":"word"
      },
      {
         "start_time":1.0799999237060547,
         "end_time":1.6154180765151978,
         "alternatives":[
            {
               "content":"cyclonic",
               "confidence":1.0
            }
         ],
         "type":"word"
      },
      {
         "start_time":1.6154180765151978,
         "is_eos":true,
         "end_time":1.6154180765151978,
         "alternatives":[
            {
               "content":".",
               "confidence":1.0
            }
         ],
         "type":"punctuation"
      }
   ],
   "metadata":{
      "transcript":"gale eight becoming cyclonic.",
      "start_time":190.65994262695312,
      "end_time":194.46994256973267
   },
   "format":"2.7"
}

You can use the metadata.transcript property to get the complete final transcript as a chunk of plain text. The format property describes the exact version of the transcription output format. This may change in future releases if the output format is updated.

Example Usage

This section provides some client code samples that show simple usage of the V2 WebSockets Speech API. It shows how you can test your Real-time Appliance or Conttainer using a minimal WebSocket client.

JavaScript

The basic usage of the WebSockets interface from a JavaScript client is shown in this section. In the first instance you setup the connection to the server and define the various event handlers that are required:

var ws = new WebSocket('wss://rta:9000/v2');
ws.binaryType = "arraybuffer";
ws.onopen = function(event) { onOpen(event) };
ws.onmessage = function(event) { onMessage(event) };
ws.onclose = function(event) { onClose(event) };
ws.onerror = function(event) { onError(event) };

In the above example, the hostname of the Real-time Appliance or Container is rta – change this to match the IP address or hostname of your Real-time Appliance or Container. The port used is 9000 and you need to make sure that you add '/v2' to the WebSocket URI. Note that the Real-time Appliance only supports the secure WebSocket (wss) protocol. On the other hand the Real-time Container only supports WebSocket (ws) protocol. You should also ensure that the binaryType property of the WebSocket object is set to "arraybuffer".

In the onopen handler you initiate the session by sending the StartRecognition message to the server, for example:

function onOpen(evt) {
    var msg = {
        "message": "StartRecognition",
        "transcription_config": {
            "language": "en",
            "output_locale": "en-GB"
        },
        "audio_format": {
            "type": "raw",
            "encoding": "pcm_s16le",
            "sample_rate": 16000
        }
    };

    ws.send(JSON.stringify(msg));
}

An onmessage handler is where you will respond to the server-initiated messages sent by the appliance or container, and decide how to handle them. Typically, this involves implementing functions to display or process data that you get back from the server.

function onMessage(evt) {
    var objMsg = JSON.parse(evt.data);

    switch (objMsg.message) {
        case "RecognitionStarted":
            recognitionStarted(objMsg); // TODO
            break;

        case "AudioAdded":
            audioAdded(objMsg);  // TODO
            break;

        case "AddPartialTranscript":
        case "AddTranscript":
            transcriptOutput(objMsg);  // TODO
            break;

        case "EndOfTranscript":
            endTranscript();  // TODO
            break;

        case "Info":
        case "Warning":
        case "Error":
            showMessage(objMsg);  // TODO
            break;

        default:
            console.log("UNKNOWN MESSAGE: " + objMsg.message);
    }
}

Once the WebSocket is initialized, the StartRecognition message is sent to the appliance or container to setup the audio input. It is then a matter of sending audio data periodically using the AddAudio message.

Your AddAudio message will take audio from a source (for example microphone input, or an audio stream) and pass it to the Real-time Appliance or Container.

// Send audio data to the API using the AddData message.
function addAudio(audioData) {
   ws.send(audioData);
   seqNoIn++;
}

In this example we use a counter seqNoIn to keep track of the addAudio messages we've sent.

A set of server-initiated transcript messages are triggered to indicate the availability of transcripted text:

  • AddTranscript
  • AddPartialTranscript

See above for changes to the JSON output schema in the V2 API. For full details of the output schema refer to the AddTranscript section in the API reference.

Finally, the client should send an EndOfStream message and close the WebSocket when it terminates. This should be done in order to release resources on the appliance or container and allow other clients to connect and use resources.

The Mozilla developer network provides a useful reference to the WebSocket API.

Python Libraries

For all Speechmatics' supported Real-time products, you can use a Python library called speechmatics-python. The library is available here if you require it.

The speechmatics-python library can be incorporated into your own applications, used as a reference for your own client library, or called directly from the command line (CLI) like this (to pass a test audio file to the appliance or container):

speechmatics transcribe --url ws://rtc:9000/v2 --lang en --operating-point enhanced --ssl-mode none test.mp3

Note that configuration options are specified on the command-line as parameters, with a '_' character in the configuration option being replaced by a '-'. The CLI option accepts an audio stream on standard input, meaning that you can stream in a live microphone feed. To get help on the CLI use the following command:

speechmatics transcribe --help

The library depends on Python 3.7 or above, since it makes use of some of the newer asyncio features introduced with Python 3.7.