This guide will walk you through the steps needed to deploy the Speechmatics Real-time Container ready for transcription.
After these steps, the Docker Image can be used to create containers that will transcribe audio files. More information about using the API for real-time transcription is detailed in the Speech API guide.
Speechmatics containerized deployments are built on the Docker platform. At present a separate Docker image is required for each language to be transcribed. Each docker image takes about 3GB of storage.
A single image can be used to create and run multiple containers concurrently, each running container will require the following resources:
The host machine requires a processor with following minimum specification: Intel® Xeon® CPU E5-2630 v4 (Sandy Bridge) 2.20GHz (or equivalent). This is important because these chipsets (and later ones) support Advanced Vector Extensions (AVX). The machine learning algorithms used by Speechmatics ASR require the performance optimizations that AVX provides. You should also ensure that your hypervisor has AVX enabled.
Each container:
The Speechmatics Docker images are obtainable from the Speechmatics Docker repository (jfrog.io). If you do not have a Speechmatics software repository account or have lost your details, please contact Speechmatics support support@speechmatics.com.
The latest information about the containers can be found in the solutions section of the support portal. If a support account is not available or the Containers section is not visible in the support portal, please contact Speechmatics support support@speechmatics.com for help.
Prior to pulling any Docker images, the following must be known:
en
for English or de
for German)After gaining access to the relevant details for the Speechmatics software repository, follow the steps below to login and pull the Docker images that are required.
Ensure the Speechmatics Docker URL and software repository username and password are available. The endpoint being used will require Docker to be installed. For example:
docker login https://speechmatics-docker-example.jfrog.io
You will be prompted for username and password. If successful, you will see the response:
Login Succeeded
If unsuccessful, please verify your credentials and URL. If problems persist, please contact Speechmatics support.
To pull the Docker image to the local environment follow the instructions below. Each supported language pack comes as a different Docker image, so the process will need to be repeated for each required language.
Example pulling Global English (en) with the 2.1.0 TAG:
docker pull speechmatics-docker-example.jfrog.io/rt-asr-transcriber-en:1.0.0
Example pulling Spanish (es) with the 2.1.0 TAG:
docker pull speechmatics-docker-example.jfrog.io/rt-asr-transcriber-es:1.0.0
The image will start to download. This could take a while depending on your connection speed.
Note: Speechmatics require all customers to cache a copy of the Docker image(s) within their own environment. Please do not pull directly from the Speechmatics software repository for each deployment.
TODO
Once the Docker image has been pulled into a local environment, it can be started using the Docker run
command. More details about operating and managing the container are available in the Docker API documentation.
Here's a couple of examples of how to start the container from the command-line:
docker run -p 9000:9000 -p 8001:8001 speechmatics-docker-example.jfrog.io/rt-asr-transcriber-en:1.0.0
docker run -p 9000:9000 -p 8001:8001 speechmatics-docker-example.jfrog.io/rt-asr-transcriber-es:1.0.0
The Docker run
options used are:
Name | Description |
---|---|
--port, -p | Expose ports on the container so that they are accessible from the host |
See Docker docs for a full list of the available options.
The supported method for passing audio to a container is to use a Websocket. A session is setup with configuration parameters passed in using a StartRecognition
message, and thereafter audio is sent to the container in binary chunks, with transcripts being returned in an AddTranscript
message.
In the AddTranscript
message individual result segments are returned, corresponding to audio segments defined by pauses (and other latency measurements).
The results list in the V2 Output format are sorted by increasing start_time
, with a supplementary rule to sort by decreasing end_time
. Confidence precision is to 6 decimal places. See below for an example:
{
"message": "AddTranscript",
"format": "2.5",
"metadata": {
"transcript": "full tell radar",
"start_time": 0.11,
"end_time": 1.07
},
"results": [
{
"type": "word",
"start_time": 0.11,
"end_time": 0.40,
"alternatives": [
{ "content": "full", "confidence": 0.7 }
]
},
{
"type": "word",
"start_time": 0.41,
"end_time": 0.62,
"alternatives": [
{ "content": "tell", "confidence": 0.6 }
]
},
{
"type": "word",
"start_time": 0.65,
"end_time": 1.07,
"alternatives": [
{ "content":"radar", "confidence": 1.0 }
]
}
]
}