Examples how to use the V2 API

The V2 WebSocket Speech API aligns with other Speechmatics platforms such as the Batch Virtual Appliance and Speechmatics Cloud Offering.

WebSocket URI

To use the V2 API you use the '/v2' endpoint for the URI, for example:

ws://rt-asr.example.com:9000/v2

If you are using the Real-time Container then you will need to use the ws:// scheme, for example: ws://rt-asr.example.com:9000/v2. If you need to access the Real-time Container over a secure WebSocket connection from you client, then you'll need to consider an SSL offload from a load-balancer or similar.

Session Configuration

The V2 API is configured by sending a StartRecognition message initially when the WebSocket connection begins. We have designed the format of this message to be very similar to the config.json object that has been used for a while now with the Speechmatics batch mode platforms (Batch Virtual Appliance, Batch Container and Cloud Offering). The transcription_config section of the message should be almost identical between the two modes. There are some minor differences (for example batch features a different set of diarization options, and real-time features some settings which don't apply to batch such as max_delay).

TranscriptionConfig

A transcription_config structure is used to specify various configuration values for the recognition engine when the StartRecognition message is sent to the server. All values apart from language are optional. Here's an example of calling the StartRecognition message with this structure:

{
   "message": "StartRecognition",
   "transcription_config": {
      "language": "en"
   },
   "audio_format": {
      "type": "raw",
      "encoding": "pcm_f32le",
      "sample_rate": 16000
   }
}

AddAudio

Once the websocket session is setup and you've successfully called StartRecognition you'll receive a RecognitionStarted message from server. You can then just to send the binary audio chunks, which we refer to as AddAudio messages.

You would replace this in the V2 API with much simpler code:

// NEW V2 EXAMPLE
function addAudio(audioData) {
    ws.send(audioData);
    seqNoIn++;
}

We recommend that you do not send more than 10 seconds of audio data or 500 individual AddAudio messages ahead of time.

Final and Partial Transcripts

The AddTranscript and AddPartialTranscript messages from the server output a JSON format which aligns with the JSON output format used by other Speechmatics products. There is a now a results list which contains the transcribed words and punctuation marks along with timings and confidence scores. Here's an example of a final transcript output:

{
   "message":"AddTranscript",
   "results":[
      {
         "start_time":0.11670026928186417,
         "end_time":0.4049381613731384,
         "alternatives":[
            {
               "content":"gale",
               "confidence":0.7034434080123901
            }
         ],
         "type":"word"
      },
      {
         "start_time":0.410246878862381,
         "end_time":0.6299981474876404,
         "alternatives":[
            {
               "content":"eight",
               "confidence":0.670033872127533
            }
         ],
         "type":"word"
      },
      {
         "start_time":0.6599999666213989,
         "end_time":1.0799999237060547,
         "alternatives":[
            {
               "content":"becoming",
               "confidence":1.0
            }
         ],
         "type":"word"
      },
      {
         "start_time":1.0799999237060547,
         "end_time":1.6154180765151978,
         "alternatives":[
            {
               "content":"cyclonic",
               "confidence":1.0
            }
         ],
         "type":"word"
      },
      {
         "start_time":1.6154180765151978,
         "is_eos":true,
         "end_time":1.6154180765151978,
         "alternatives":[
            {
               "content":".",
               "confidence":1.0
            }
         ],
         "type":"punctuation"
      }
   ],
   "metadata":{
      "transcript":"gale eight becoming cyclonic.",
      "start_time":190.65994262695312,
      "end_time":194.46994256973267
   },
   "format":"2.7"
}

You can use the metadata.transcript property to get the complete final transcript as a chunk of plain text. The format property describes the exact version of the transcription output format, which is currently 2.7. This may change in future releases if the output format is updated.

Requesting an enhanced model

Speechmatics supports two different models within each language pack; a standard or an enhanced model. The standard model is the faster of the two, whilst the enhanced model provides a higher accuracy, but a slower turnaround time.

The enhanced model is a premium model. Please contact your account manager or Speechmatics if you would like access to this feature.

An example of requesting the enhanced model is below

{
  "message": "StartRecognition",
  "audio_format": {
    "type": "raw",
    "encoding": "pcm_f32le",
    "sample_rate": 16000
  },
{
  "transcription_config": {
    "language": "en",
    "operating_point": "enhanced"
  }
}

Please note: standard, as well as being the default option, can also be explicitly requested with the operating_point parameter.

Domain Language Packs

Some Speechmatics language packs are optimized for specific domains where high accuracy for specific vocabulary and terminology is required. Using the domain parameter provides additional transcription accuracy, and must be used in conjunction with a standard language pack (this is currently limited to the "finance" domain and supports the "en" language pack). An example of how this looks is below:

{
  "transcription_config": {
    "language": "en",
    "domain": "finance"
  }
}

These domain language packs are built on top of our global language packs so give the highest accuracy in different acoustic environments that our customers have come to expect.

Please note that if you are using the "Finance" domain language pack you will need to use the "en-finance" container image, located at speechmatics-docker-public.jfrog.io/batch-asr-transcriber-en-finance. More details about how to pull container images can be found here

It is expected that whilst there will be improvements for the specific domain there can be degradation in accuracy for other outside domains.

Advanced punctuation

All Speechmatics language packs support Advanced Punctuation. This uses machine learning techniques to add in more naturalistic punctuation, improving the readability of your transcripts.

The following punctuation marks are supported for each language:

Language(s)	Supported Punctuation	Comment
Cantonese, Mandarin	，。？！、	Full-width punctuation supported
Japanese	。、	Full-width punctuation supported
Hindi	। ? , !
All other languages	. , ! ?

If you do not want to see any of the supported punctuation marks in the output, then you can explicitly control this through the punctuation_overrides settings, for example:

"transcription_config": {
   "language": "en",
   "punctuation_overrides": {
      "permitted_marks":[ ".", "," ]
   }
}

This will exclude exclamation and question marks from the returned transcript.

All Speechmatics output formats support Advanced Punctuation. JSON output places punctuation marks in the results list marked with a type of "punctuation".

Example Usage

This section provides some client code samples that show simple usage of the V2 WebSockets Speech API. It shows how you can test your Real-Time Appliance or Container using a minimal WebSocket client.

JavaScript

The basic usage of the WebSockets interface from a JavaScript client is shown in this section. In the first instance you setup the connection to the server and define the various event handlers that are required:

var ws = new WebSocket('ws://rtc:9000/v2');
ws.binaryType = "arraybuffer";
ws.onopen = function(event) { onOpen(event) };
ws.onmessage = function(event) { onMessage(event) };
ws.onclose = function(event) { onClose(event) };
ws.onerror = function(event) { onError(event) };

Change the hostname from the above example to match the IP address or hostname of your Real-Time Appliance or Container. The port used is 9000 and you need to make sure that you add '/v2' to the WebSocket URI. Note that the Real-time Container only supports WebSocket (ws) protocol. You should also ensure that the binaryType property of the WebSocket object is set to "arraybuffer".

In the onopen handler you initiate the session by sending the StartRecognition message to the server, for example:

function onOpen(evt) {
    var msg = {
        "message": "StartRecognition",
        "transcription_config": {
            "language": "en",
            "output_locale": "en-GB"
        },
        "audio_format": {
            "type": "raw",
            "encoding": "pcm_s16le",
            "sample_rate": 16000
        }
    };

    ws.send(JSON.stringify(msg));
}

An onmessage handler is where you will respond to the server-initiated messages sent by the appliance or container, and decide how to handle them. Typically, this involves implementing functions to display or process data that you get back from the server.

function onMessage(evt) {
    var objMsg = JSON.parse(evt.data);

    switch (objMsg.message) {
        case "RecognitionStarted":
            recognitionStarted(objMsg); // TODO
            break;

        case "AudioAdded":
            audioAdded(objMsg);  // TODO
            break;

        case "AddPartialTranscript":
        case "AddTranscript":
            transcriptOutput(objMsg);  // TODO
            break;

        case "EndOfTranscript":
            endTranscript();  // TODO
            break;

        case "Info":
        case "Warning":
        case "Error":
            showMessage(objMsg);  // TODO
            break;

        default:
            console.log("UNKNOWN MESSAGE: " + objMsg.message);
    }
}

Once the WebSocket is initialized, the StartRecognition message is sent to the appliance or container to setup the audio input. It is then a matter of sending audio data periodically using the AddAudio message.

Your AddAudio message will take audio from a source (for example microphone input, or an audio stream) and pass it to the Real-Time Appliance or Container.

// Send audio data to the API using the AddData message.
function addAudio(audioData) {
   ws.send(audioData);
   seqNoIn++;
}

In this example we use a counter seqNoIn to keep track of the AddAudio messages we've sent.

A set of server-initiated transcript messages are triggered to indicate the availability of transcribed text:

AddTranscript
AddPartialTranscript

See above for changes to the JSON output schema in the V2 API. For full details of the output schema refer to the AddTranscript section in the API reference.

Finally, the client should send an EndOfStream message and close the WebSocket when it terminates. This should be done in order to release resources on the appliance or container and allow other clients to connect and use resources.

The Mozilla developer network provides a useful reference to the WebSocket API.

Python

Standalone Real-Time Container Usage

If you are using the Real-Time Container, you can use a Python library called speechmatics-python. This library is available on Github here. You can also use this library for the Real-Time Virtual Appliance.

The speechmatics-python library can be incorporated into your own applications, used as a reference for your own client library, or called directly from the command line (CLI) like this (to pass a test audio file to the appliance or container):

speechmatics transcribe --url ws://rtc:9000/v2 --lang en --ssl-mode none test.mp3

Note that configuration options are specified on the command-line as parameters, with a '_' character in the configuration option being replaced by a '-'. The CLI option accepts an audio stream on standard input, meaning that you can stream in a live microphone feed. To get help on the CLI use the following command:

speechmatics transcribe --help

The library depends on Python 3.7 or above, since it makes use of some of the newer asyncio features introduced with Python 3.7.