This is the documentation for a previous version of our product. Click here to see the latest version.

Examples how to use the V2 API

The V2 WebSocket Speech API aligns with other Speechmatics platforms such as the Batch Virtual Appliane and Speechmatics SaaS.

WebSocket URI

To use the V2 API you use the '/v2' endpoint for the URI, for example:

wss://rt-asr.example.com:9000/v2
WebSocket Schemes

If you are using the Real-Time Container then you will need to use the ws:// scheme, for example: ws://rt-asr.example.com:9000/v2. If you need to access the Real-Time Container over a secure WebSocket connection from you client, then you'll need to consider an SSL offload from a load-balancer or similar.

Session Configuration

The V2 API is configured by sending a StartRecognition message initially when the WebSocket connection begins. We have designed the format of this message to be very similar to the config.json object that has been used for a while now with the Speechmatics batch mode platforms (Batch Virtual Appliance, Batch Container and SaaS). The transcription_config section of the message should be almost identical between the two modes. There are some minor differences (for example batch features a different set of diarization options, and real-time features some settings which don't apply to batch such as max_delay).

TranscriptionConfig

A transcription_config structure is used to specify various configuration values for the recognition engine when the StartRecognition message is sent to the server. All values apart from language are optional. Here's an example of calling the StartRecognition message with this structure:

{
        "message": "StartRecognition",
        "transcription_config": {
            "language": "en"
        },
        "audio_format": {
            "type": "raw",
            "encoding": "pcm_f32le",
            "sample_rate": 16000
        }
    }
}

AddAudio

Once the websocket session is setup and you've sucessfully called StartRecognition you'll receive a RecognitionStarted message from server. You can then just to send the binary audio chunks, which we refer to as AddAudio messages.

You would replace this in the V2 API with much simpler code:

// NEW V2 EXAMPLE
function addAudio(audioData) {
    ws.send(audioData);
    seqNoIn++;
}

We recommend that you do not send more than 10 seconds of audio data or 500 individual AddAudio messages ahead of time.

Final and Partial Transcripts

The AddTranscript and AddPartialTranscript messages from the server output a JSON format which aligns with the JSON output format used by other Speechmatics products. There is a now a results list which contains the transcribed words and punctuation marks along with timings and confidence scores. Here's an example of a final transcript output:

{
   "message":"AddTranscript",
   "results":[
      {
         "start_time":0.11670026928186417,
         "end_time":0.4049381613731384,
         "alternatives":[
            {
               "content":"gale",
               "confidence":0.7034434080123901
            }
         ],
         "type":"word"
      },
      {
         "start_time":0.410246878862381,
         "end_time":0.6299981474876404,
         "alternatives":[
            {
               "content":"eight",
               "confidence":0.670033872127533
            }
         ],
         "type":"word"
      },
      {
         "start_time":0.6599999666213989,
         "end_time":1.0799999237060547,
         "alternatives":[
            {
               "content":"becoming",
               "confidence":1.0
            }
         ],
         "type":"word"
      },
      {
         "start_time":1.0799999237060547,
         "end_time":1.6154180765151978,
         "alternatives":[
            {
               "content":"cyclonic",
               "confidence":1.0
            }
         ],
         "type":"word"
      },
      {
         "start_time":1.6154180765151978,
         "is_eos":true,
         "end_time":1.6154180765151978,
         "alternatives":[
            {
               "content":".",
               "confidence":1.0
            }
         ],
         "type":"punctuation"
      }
   ],
   "metadata":{
      "transcript":"gale eight becoming cyclonic.",
      "start_time":190.65994262695312,
      "end_time":194.46994256973267
   },
   "format":"2.4"
}

You can use the metadata.transcript property to get the complete final transcript as a chunk of plain text. The format property describes the exact version of the transcription output format, which is currently 2.4. This may change in future releases if the output format is updated.

Advanced Punctuation

Some language models (English, French, German and Spanish currently) support advanced punctuation. This uses machine learning techniques to add in more naturalistic punctuation, improving the readability of your transcripts. As well as putting punctuation marks in more naturalistic positions in the output, additional punctuation marks such as commas (,) exclamation marks (!) and question marks (?) will also appear.

There is no need to explicitly enable this in the configuration; languages that support advanced punctuation will automatically output these marks. If you do not want to see these punctuation marks in the output, then you can explicitly control this through the punctuation_overrides setting within the transcription_config object, for example:

"transcription_config": {
   "language": "en",
   "punctuation_overrides": {
      "permitted_marks":[ "." ]
   }
}

Note that changing the punctuation setting from the default can take a couple of seconds, which means if the user is using non-default neural punctuation sensitivity, after they send the StartRecognition message, there will be a slight delay (2-3 seconds) before the RecognitionStarted message is sent back.

The JSON output places punctuation marks in the results list marked with a type of "punctuation". So you can also filter on the output if you want to modify or remove punctuation.

Example Usage

This section provides some client code samples that show simple usage of the V2 WebSockets Speech API. It shows how you can test your Real-Time Appliance or Conttainer using a minimal WebSocket client.

JavaScript

The basic usage of the WebSockets interface from a JavaScript client is shown in this section. In the first instance you setup the connection to the server and define the various event handlers that are required:

var ws = new WebSocket('wss://rta:9000/v2');
ws.binaryType = "arraybuffer";
ws.onopen = function(event) { onOpen(event) };
ws.onmessage = function(event) { onMessage(event) };
ws.onclose = function(event) { onClose(event) };
ws.onerror = function(event) { onError(event) };

In the above example, the hostname of the Real-Time Appliance or Container is rta – change this to match the IP address or hostname of your Real-Time Appliance or Container. The port used is 9000 and you need to make sure that you add '/v2' to the WebSocket URI. Note that the Real-Time Appliance only supports the secure WebSocket (wss) protocol. On the other hand the Real-Time Container only supports WebSocket (ws) protocol. You should also ensure that the binaryType property of the WebSocket object is set to "arraybuffer".

In the onopen handler you initiate the session by sending the StartRecognition message to the server, for example:

function onOpen(evt) {
    var msg = {
        "message": "StartRecognition",
        "transcription_config": {
            "language": "en",
            "output_locale": "en-GB"
        },
        "audio_format": {
            "type": "raw",
            "encoding": "pcm_s16le",
            "sample_rate": 16000
        }
    };

    ws.send(JSON.stringify(msg));
}

An onmessage handler is where you will respond to the server-initiated messages sent by the appliance or container, and decide how to handle them. Typically, this involves implementing functions to display or process data that you get back from the server.

function onMessage(evt) {
    var objMsg = JSON.parse(evt.data);

    switch (objMsg.message) {
        case "RecognitionStarted":
            recognitionStarted(objMsg); 
            break;

        case "AudioAdded":
            audioAdded(objMsg);  
            break;

        case "AddPartialTranscript":
        case "AddTranscript":
            transcriptOutput(objMsg);  
            break;

        case "EndOfTranscript":
            endTranscript();  
            break;

        case "Info":
        case "Warning":
        case "Error":
            showMessage(objMsg);  
            break;

        default:
            console.log("UNKNOWN MESSAGE: " + objMsg.message);
    }
}

Once the WebSocket is initialized, the StartRecognition message is sent to the appliance or container to setup the audio input. It is then a matter of sending audio data periodically using the AddAudio message.

Your AddAudio message will take audio from a source (for example microphone input, or an audio stream) and pass it to the Real-Time Appliance or Container.

// Send audio data to the API using the AddData message.
function addAudio(audioData) {
   ws.send(audioData);
   seqNoIn++;
}

In this example we use a counter seqNoIn to keep track of the addAudio messages we've sent.

A set of server-initiated transcript messages are triggered to indicate the availability of transcripted text:

  • AddTranscript
  • AddPartialTranscript

See above for changes to the JSON output schema in the V2 API. For full details of the output schema refer to the AddTranscript section in the API reference.

Finally, the client should send an EndOfStream message and close the WebSocket when it terminates. This should be done in order to release resources on the appliance or container and allow other clients to connect and use resources.

The Mozilla developer network provides a useful reference to the WebSocket API.

Python

Real-Time Virtual Appliance Usage

Speechmatics provides a Python library called smwebsocket-py which is a wrapper to the WebSockets API for use with the Real-Time Virtual Appliance, making it easy to incorporate Speechmatics real-time transcription into your Python program. Please contact support@speechmatics.com if you require this library.

The smwebsocket-py library can be incorporated into your own applications, used as a reference for your own client library, or called directly from the command line (CLI) like this (to pass a test audio file to the appliance or container):

python -m smwebsocket.cli --url wss://rta:9000/v2 --max-delay 3 --lang en test.mp3

Note that configuration options are specified on the command-line as parameters, with a '_' character in the configuration option being replaced by a '-'. The CLI option accepts an audio stream on standard input, meaning that you can stream in a live microphone feed. To get help on the CLI use the following command:

python -m smwebsocket.cli --help

The library depends on Python 3.7 or above, since it makes use of some of the newer asyncio features introduced with Python 3.7.

Standalone Real-Time Container Usage

If you are using the Real-Time Container, you can use a Python library called speechmatics-python. Please contact support@speechmatics.com if you require this library. You can also use this library for the Real-Time Virtual Appliance.

The speechmatics-python library can be incorporated into your own applications, used as a reference for your own client library, or called directly from the command line (CLI) like this (to pass a test audio file to the appliance or container):

speechmatics transcribe --url ws://rtc:9000/v2 --lang en --ssl-mode none test.mp3

Note that configuration options are specified on the command-line as parameters, with a '_' character in the configuration option being replaced by a '-'. The CLI option accepts an audio stream on standard input, meaning that you can stream in a live microphone feed. To get help on the CLI use the following command:

speechmatics transcribe --help

The library depends on Python 3.7 or above, since it makes use of some of the newer asyncio features introduced with Python 3.7.