Real-time Container Quick Start Guide

This guide will walk you through the steps needed to deploy the Speechmatics Real-time Container ready for transcription.

Check system requirements
Pull the Docker Image
Run the Container

After these steps, the Docker Image can be used to create containers that will transcribe audio files. More information about using the API for real-time transcription is detailed in the Speech API guide.

System Requirements

Speechmatics containerized deployments are built on the Docker platform. At present a separate Docker image is required for each language to be transcribed. Each docker image takes about 3GB of storage.

A single image can be used to create and run multiple containers concurrently, each running container will require the following resources:

1 vCPU
2-5GB RAM
100MB hard disk space

If you are using the enhanced model, it is recommended to use the upper limit of the RAM recommendations

Host recommended specs

The host machine requires a processor with following microarchitecture specification to run at expected performance:

If using the standard model offering at least the Broadwell Class is required
If using the enhanced model offering at least the CascadeLake class is required
It is also recommended if using the enhanced model that the hardware supports the AVX512_VNNI flag, as this will greatly improve transcription processing speed
- Examples of this among popular hosting providers include the Microsoft Azure DSV-4 class, and the Amazon M5n EC2 server class
- Disabling hyperthreading when running the enhanced model can also improve transcription speed. How to do so when running on Amazon Web Services is shown here, and for Microsoft Azure please see here

AVX flags

Advanced Vector Extensions (AVX) are necessary to allow Speechmatics to carry out transcription.

For the enhanced model, it is recommended to use the AVX512_VNNI flag, which will substantially improve transcription processing speed.
For the standard model, it is necessary to use at least a processor that supports Advanced Vector Extensions 2 (AVX2).
- You should also ensure your hypervisor is enabled to use AVX2.

Architecture

Each container:

Provides the ability to transcribe speech data in a predefined language from a live stream or a recorded audio file. The container will receive audio input using a WebSocket protocol, and will provide the following output:
- Words in the transcript
- Word confidence
- Timing information
- Relevant metadata information
Multiple instances of the container can be run on the same Docker host. This enables scaling of a single language or multiple-languages as required
All data is transitory, once a container completes its transcription it removes all record of the operation, no data is persisted.

Supported File Formats

Only the following file formats are supported:

aac
amr
flac
m4a
mp3
mp4
mpg
ogg
wav

Accessing the Image

The Speechmatics Docker images are obtainable from the Speechmatics Docker repository (jfrog.io). If you do not have a Speechmatics software repository account or have lost your details, please contact Speechmatics support support@speechmatics.com.

The latest information about the containers can be found in the solutions section of the support portal. If a support account is not available or the Containers section is not visible in the support portal, please contact Speechmatics support support@speechmatics.com for help.

Prior to pulling any Docker images, the following must be known:

Speechmatics Docker credentials – provided by the Speechmatics team
Speechmatics Docker URL - https://speechmatics-docker-public.jfrog.io
Image name (which usually includes the language code of the target language, e.g. en for English or de for German)
Image tag - which identifies the image version

Getting the Image

After gaining access to the relevant details for the Speechmatics software repository, follow the steps below to login and pull the Docker images that are required, using a method such as the CLI

Software Repository Login

Ensure the Speechmatics Docker URL and software repository username and password are available. The endpoint being used will require Docker to be installed. For example:

docker login https://speechmatics-docker-public.jfrog.io

You will be prompted for username and password. If successful, you will see the response:

Login Succeeded

If unsuccessful, please verify your credentials and URL. If problems persist, please contact Speechmatics Support.

Pulling the Image

To pull the Docker image to the local environment follow the instructions below. Each supported language pack comes as a different Docker image, so the process will need to be repeated for each required language.

Example pulling Global English (en) with the 2.1.0 TAG:

docker pull speechmatics-docker-public.jfrog.io/rt-asr-transcriber-en:2.1.0

Example pulling Spanish (es) with the 2.1.0 TAG:

docker pull speechmatics-docker-public.jfrog.io/rt-asr-transcriber-es:2.1.0

The image will start to download. This could take a while depending on your connection speed.

Docker Image Caching

Speechmatics require all customers to cache a copy of the Docker image(s) within their own environment.

Please do not pull directly from the Speechmatics software repository for each deployment.

As of Feb 2021, all Speechmatics containers are built using Docker Buildkit. This should not impact your internal management of the Speechmatics Container. If you use JFrog to host the Speechmatics container there may be some UI issues see here, but these are cosmetic and should not impact your ability to pull and run the container. If your internal registry uses Nexus and self-signed certificates, please make sure you are on Nexus version 3.15 or above or you may encounter errors.

Licensing

You should have received a confidential license file from Speechmatics containing a token to use to license your container. The contents of the file received should look similar to this:

{
    "contractid": 1,
    "creationdate": "2020-03-24 17:43:35",
    "customer": "Speechmatics",
    "id": "c18a4eb990b143agadeb384cbj7b04c3",
    "is_trial": true,
    "metadata": {
        "key_pair_id": 1,
        "request": {
            "customer": "Speechmatics",
            "features": [
                "MAPRT",
                "LANY"
            ],
            "isTrial": true,
            "notValidAfter": "2021-01-01",
            "validFrom": "2020-01-01"
        }
    },
    "signedclaimstoken": "example",
}

The validFrom and notValidAfter keys in the license file specify the start and end dates for the validity of your license. The license is valid from 00:00 UTC on the start date to 00:00 UTC on the expiry date. After the expiry date, the container will continue to run but will not transcribe audio. You should apply for a new license before this happens.

Licensing does not require an internet connection.

There are two ways to apply the license to the container.

As a volume-mapped file

The license file should be mapped to the path /license.json within the container. For example:

docker run --volume $PWD/my_license.json:/license.json:ro rt-asr-transcriber-en:2.1.0

As an environment variable

Setting an environment variable named LICENSE_TOKEN is also a valid way to license the container. The contents of this variable should be set to the value of the signedclaimstoken from within the license file.

For example, copy the signedclaimstoken from the license file (without the quotation marks) and set the environment variable as below:

docker run -e LICENSE_TOKEN='example' rt-asr-transcriber-en:2.1.0

If both a volume-mapped file and an environment variable are provided simultaneously then the volume-mapped file will be ignored.

Using the Container

Once the Docker image has been pulled into a local environment, it can be started using the Docker run command either via a wrapper, or via the CLI. More details about operating and managing the container are available in the Docker API documentation.

Here's an example of how to start the container from the command-line:

docker run -p 9000:9000 -p 8001:8001 -e LICENSE_TOKEN='example' rt-asr-transcriber-en:2.1.0

The Docker run options used are:

Name	Description
`--port, -p`	Expose ports on the container so that they are accessible from the host
`--env, -e`	Set the value of an environment variable

See Docker docs for a full list of the available options.

Input Modes

The supported method for passing audio to a container is to use a Websocket. A session is setup with configuration parameters passed in using a StartRecognition message, and thereafter audio is sent to the container in binary chunks, with transcripts being returned in an AddTranscript message.

In the AddTranscript message individual result segments are returned, corresponding to audio segments defined by pauses (and other latency measurements).

Output

The results list in the V2 Output format are sorted by increasing start_time, with a supplementary rule to sort by decreasing end_time. Confidence precision is to 6 decimal places. See below for an example:

{
    "message": "AddTranscript",
    "format": "2.7",
    "metadata": {
        "transcript": "full tell radar",
        "start_time": 0.11,
        "end_time": 1.07
    },
    "results": [
        {
            "type": "word",
            "start_time": 0.11,
            "end_time": 0.40,
            "alternatives": [
                { "content": "full", "confidence": 0.7 }
            ]
        },
        {
            "type": "word",
            "start_time": 0.41,
            "end_time": 0.62,
            "alternatives": [
                { "content": "tell", "confidence": 0.6 }
            ]
        },
        {
            "type": "word",
            "start_time": 0.65,
            "end_time": 1.07,
            "alternatives": [
                { "content":"radar", "confidence": 1.0 }
            ]
        }
    ]
}

Transcription duration information

The container will output a log message after every transcription session to indicate the duration of speech transcribed during that session. This duration only includes speech, and not any silence or background noise which was present in the audio. It may be useful to parse these log messages if you are asked to report usage back to us, or simply for your own records.

The format of the log messages produced should match the following example:

2020-04-13 22:48:05.312 INFO sentryserver Transcribed 52 seconds of speech

Consider using the following regular expression to extract just the seconds part from the line if you are parsing it:

^.+ .+ INFO sentryserver Transcribed (\d+) seconds of speech$

Running a Container in Read-Only Mode

Users may wish to run the container in read-only mode. This may be necessary due to their regulatory environment, or a requirement not to write any media file to disk. An example of how to do this is below.

docker run -it --read-only \
-p 9000:9000 \
--tmpfs /tmp \
-e LICENSE_TOKEN=$TOKEN_VALUE \
rt-asr-transcriber-en:2.1.0

The container still requires a temporary directory with write permissions. Users can provide a directory (e.g /tmp) by using the --tmpfs Docker argument. A tmpfs mount is temporary, and only persisted in the host memory. When the container stops, the tmpfs mount is removed, and files written there won’t be persisted.

If customers want to use the shared custom dictionary cache feature, they must also specify the location of cache and mount it as a volume

docker run -it --read-only \
-p 9000:9000 \
--tmpfs /tmp \
-v /cachelocation:/cache \
-e LICENSE_TOKEN=$TOKEN_VALUE \
-e SM_CUSTOM_DICTIONARY_CACHE_TYPE=shared \
rt-asr-transcriber-en:2.1.0

Running a Container as a non-root user

A Real-time container can be run as a non-root user with no impact to feature functionality. This may be required if a hosting environment or a company's internal regulations specify that a container must be run as a named user.

Users may specify the non-root command by the docker run –-user $USERNUMBER:$GROUPID. User number and group ID are non-zero numerical values from a value of 1 up to a value of 65535

An example is below:

docker run -it --user 100:100 \
-p 9000:9000 \
-e LICENSE_TOKEN=$TOKEN_VALUE \
rt-asr-transcriber-en:2.1.0

How to use a Shared Custom Dictionary Cache

For more information on how the Custom Dictionary works, please see the Speech API Guide.

The Speechmatics Real-time Container includes a cache mechanism for custom dictionaries to improve set-up performance for repeated use. By using this cache mechanism, transcription will start more quickly when repeatedly using the same custom dictionaries. You will see performance benefits on re-using the same custom dictionary from the second time onwards.

It is not a requirement to use the shared cache to use the Custom Dictionary.

The cache volume is safe to use from multiple containers concurrently if the operating system and its filesystem support file locking operations. The cache can store multiple custom dictionaries in any language used for transcription. It can support multiple custom dictionaries in the same language.

If a custom dictionary is small enough to be stored within the cache volume, this will take place automatically if the shared cache is specified.

For more information about how the shared cache storage management works, please see Maintaining the Shared Cache.

We highly recommend you ensure any location you use for the shared cache has enough space for the number of custom dictionaries you plan to allocate there. How to allocate custom dictionaries to the shared cache is documented below.

How to set up the Shared Cache

The shared cache is enabled by setting the following value when running transcription:

Cache Location: You must volume map the directory location you plan to use as the shared cache to /cache when submitting a job
SM_CUSTOM_DICTIONARY_CACHE_TYPE: (mandatory if using the shared cache) This environment variable must be set to shared
SM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE: (optional if using the shared cache). This determines the maximum size of any single custom dictionary that can be stored within the shared cache in bytes
- E.G. a SM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE with a value of 10000000 would set a max storage size of any custom dictionary at 10MB
- For reference a custom dictionary wordlist with 1000 words produces a cache entry of size around 200 kB, or 200000 bytes
- A value of -1 will allow every custom dictionary to be stored within the shared cache. This is the default assumed value
- A custom dictionary cache entry larger than the SM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE will still be used in transcription, but will not be cached

Maintaining the Shared Cache

If you specify the shared cache to be used and your custom dictionary is within the permitted size, Speechmatics Real-time Container will always try to cache the custom dictionary. If a custom dictionary cannot occupy the shared cache due to other cached custom dictionaries within the allocated cache, then older custom dictionaries will be removed from the cache to free up as much space as necessary for the new custom dictionary. This is carried out in order of the least recent custom dictionary to be used.

Therefore, you must ensure your cache allocation large enough to handle the number of custom dictionaries you plan to store. We recommend a relatively large cache to avoid this situation if you are processing multiple custom dictionaries using the batch container (e.g 50 MB). If you don't allocate sufficient storage this could mean one or multiple custom dictionaries are deleted when you are trying to store a new custom dictionary.

It is recommended to use a docker volume with a dedicated filesystem with a limited size. If a user decides to use a volume that shares filesystem with the host, it is the user's responsibility to purge the cache if necessary.

Creating the Shared Cache

In the example below, transcription is run where an example local docker volume is created for the shared cache. It will allow a custom dictionary of up to 5MB to be cached.

docker volume create speechmatics-cache

docker run --rm -d \
  -p 9000:9000 \
  -e LICENSE_TOKEN='example' \
  -e SM_CUSTOM_DICTIONARY_CACHE_TYPE=shared \
  -e SM_CUSTOM_DICTIONARY_CACHE_ENTRY_MAX_SIZE=5000000 \
  -v speechmatics-cache:/cache \
  rt-asr-transcriber-en:2.1.0

speechmatics transcribe --additional-vocab gnocchi --url ws://localhost:9000/v2 --ssl-mode=none test.mp3

Viewing the Shared Cache

If all set correctly and the cache was used for the first time, a single entry in the cache should be present.

The following example shows how to check what Custom Dictionaries are stored within the cache. This will show the language, the sampling rate, and the checksum value of the cached dictionary entries.

ls $(docker inspect -f "{{.Mountpoint}}" speechmatics-cache)/custom_dictionary
en,16kHz,bef53e5bcca838a39c3707f1134bda6a09ff87aaa09203617528774734455edd

Reducing the Shared Cache Size

Cache size can be reduced by removing some or all cache entries.

rm -rf $(docker inspect -f "{{.Mountpoint}}" speechmatics-cache)/custom_dictionary/*

Manually purging the cache

Before manually purging the cache, ensure that no containers have the volume mounted, otherwise an error during transcription might occur. Consider creating a new docker volume as a temporary cache while performing purging maintenance on the cache.