This is the documentation for a previous version of our product. Click here to see the latest version.

Real-time Container Quick Start Guide

This guide will walk you through the steps needed to deploy the Speechmatics Real-time Container ready for transcription.

Check system requirements
Pull the Docker Image
Run the Container

After these steps, the Docker Image can be used to create containers that will transcribe audio files. More information about using the API for real-time transcription is detailed in the Speech API guide.

System Requirements

Speechmatics containerized deployments are built on the Docker platform. At present a separate Docker image is required for each language to be transcribed. Each docker image takes about 3GB of storage.

A single image can be used to create and run multiple containers concurrently, each running container will require the following resources:

1
2-5GB RAM
100MB hard disk space

The host machine requires a processor with following minimum specification: Intel® Xeon® CPU E5-2630 v4 (Sandy Bridge) 2.20GHz (or equivalent). This is important because these chipsets (and later ones) support Advanced Vector Extensions (AVX). The machine learning algorithms used by Speechmatics ASR require the performance optimizations that AVX provides. You should also ensure that your hypervisor has AVX enabled.

Architecture

Each container:

Provides the ability to transcribe speech data in a predefined language from a live stream or a recorded audio file. The container will receive audio input using a WebSocket interface, and will provide the following output:
- Words in the transcript
- Word confidence
- Timing information
Multiple instances of the container can be run on the same Docker host. This enables scaling of a single language or multiple-languages as required
All data is transitory, once a container completes its transcription it removes all record of the operation, no data is persisted.

Accessing the Image

The Speechmatics Docker images are obtainable from the Speechmatics Docker repository (jfrog.io). If you do not have a Speechmatics software repository account or have lost your details, please contact Speechmatics support support@speechmatics.com.

The latest information about the containers can be found in the solutions section of the support portal. If a support account is not available or the Containers section is not visible in the support portal, please contact Speechmatics support support@speechmatics.com for help.

Prior to pulling any Docker images, the following must be known:

Speechmatics Docker URL – provided by the Speechmatics team
Image name (which usually includes the language code of the target language, e.g. en for English or de for German)
Image tag - which identifies the image version

Getting the Image

After gaining access to the relevant details for the Speechmatics software repository, follow the steps below to login and pull the Docker images that are required.

Software Repository Login

Ensure the Speechmatics Docker URL and software repository username and password are available. The endpoint being used will require Docker to be installed. For example:

docker login https://speechmatics-docker-example.jfrog.io

You will be prompted for username and password. If successful, you will see the response:

Login Succeeded

If unsuccessful, please verify your credentials and URL. If problems persist, please contact Speechmatics support.

Pulling the Image

To pull the Docker image to the local environment follow the instructions below. Each supported language pack comes as a different Docker image, so the process will need to be repeated for each required language.

Example pulling Global English (en) with the 2.1.0 TAG:

docker pull speechmatics-docker-example.jfrog.io/rt-asr-transcriber-en:1.0.0

Example pulling Spanish (es) with the 2.1.0 TAG:

docker pull speechmatics-docker-example.jfrog.io/rt-asr-transcriber-es:1.0.0

The image will start to download. This could take a while depending on your connection speed.

Note: Speechmatics require all customers to cache a copy of the Docker image(s) within their own environment. Please do not pull directly from the Speechmatics software repository for each deployment.

Licensing

TODO

Using the Container

Once the Docker image has been pulled into a local environment, it can be started using the Docker run command. More details about operating and managing the container are available in the Docker API documentation.

Here's a couple of examples of how to start the container from the command-line:

docker run -p 9000:9000 -p 8001:8001 speechmatics-docker-example.jfrog.io/rt-asr-transcriber-en:1.0.0
docker run -p 9000:9000 -p 8001:8001 speechmatics-docker-example.jfrog.io/rt-asr-transcriber-es:1.0.0

The Docker run options used are:

Name	Description
`--port, -p`	Expose ports on the container so that they are accessible from the host

See Docker docs for a full list of the available options.

Input Modes

The supported method for passing audio to a container is to use a Websocket. A session is setup with configuration parameters passed in using a StartRecognition message, and thereafter audio is sent to the container in binary chunks, with transcripts being returned in an AddTranscript message.

In the AddTranscript message individual result segments are returned, corresponding to audio segments defined by pauses (and other latency measurements).

Output

The results list in the V2 Output format are sorted by increasing start_time, with a supplementary rule to sort by decreasing end_time. Confidence precision is to 6 decimal places. See below for an example:

{
    "message": "AddTranscript",
    "format": "2.5",
    "metadata": {
        "transcript": "full tell radar",
        "start_time": 0.11,
        "end_time": 1.07
    },
    "results": [
        {
            "type": "word",
            "start_time": 0.11,
            "end_time": 0.40,
            "alternatives": [
                { "content": "full", "confidence": 0.7 }
            ]
        },
        {
            "type": "word",
            "start_time": 0.41,
            "end_time": 0.62,
            "alternatives": [
                { "content": "tell", "confidence": 0.6 }
            ]
        },
        {
            "type": "word",
            "start_time": 0.65,
            "end_time": 1.07,
            "alternatives": [
                { "content":"radar", "confidence": 1.0 }
            ]
        }
    ]
}