I love services like Amazon Transcribe. They are the kind of just-futuristic-enough technology that excites my imagination the same way that magic does. It’s incredible that we have accurate, automatic speech recognition for a variety of languages and accents, in real-time. There are so many use-cases, and nearly all of them are intriguing. Until now, the Amazon Transcribe Streaming API available has been available using HTTP/2 streaming. Today, we’re adding WebSockets as another integration option for bringing real-time voice capabilities to the things you build.
If you are itching to see things in action, you can head directly to the demo, but I recommend taking a quick read through this post first.
What is Amazon Transcribe?
Amazon Transcribe applies machine learning models to convert speech in audio to text transcriptions. One of the most powerful features of Amazon Transcribe is the ability to perform real-time transcription of audio. Until now, this functionality has been available via HTTP/2 streams. Today, we’re announcing the ability to connect to Amazon Transcribe using WebSockets as well.
For real-time transcription, Amazon Transcribe currently supports British English (en-GB), US English (en-US), French (fr-FR), Canadian French (fr-CA), and US Spanish (es-US).
What are WebSockets?
WebSockets are a protocol built on top of TCP, like HTTP. While HTTP is great for short-lived requests, it hasn’t historically been good at handling situations that require persistent real-time communications. While an HTTP connection is normally closed at the end of the message, a WebSocket connection remains open. This means that messages can be sent bi-directionally with no bandwidth or latency added by handshaking and negotiating a connection. WebSocket connections are full-duplex, meaning that the server and client can both transmit data at the same time. They were also designed for cross-domain usage, so there’s no messing around with cross-origin resource sharing (CORS) as there is with HTTP.
HTTP/2 streams solve a lot of the issues that HTTP had with real-time communications, and the first Amazon Transcribe Streaming API available uses HTTP/2. WebSocket support opens Amazon Transcribe Streaming up to a wider audience, and makes integrations easier for developers that might have existing WebSocket-based integrations or knowledge.
How the Amazon Transcribe Streaming API Works
Transcribe uses AWS Signature Version 4 to authenticate requests. For WebSocket connections, use a pre-signed URL. With a pre-signed URL, all of the parameters are contained in the URL. This gives us an authenticated endpoint that we can use to establish our WebSocket.
All of the required parameters are included in our pre-signed URL as part of the query string. These are:
- language-code: The language code. One of en-US, en-GB, fr-FR, fr-CA, es-US.
- sample-rate: The sample rate of the audio, in Hz. Max of 16000 for en-US and es-US, and 8000 for the other languages.
- media-encoding: Currently only pcm is valid.
- vocabulary-name: Amazon Transcribe allows you to define custom vocabularies for uncommon or unique words that you expect to see in your data. To use a custom vocabulary, reference it here.
Audio Data Requirements
There are a few things that we need to know before we start sending data. First, Transcribe expects audio to be encoded as PCM data. The sample rate of a digital audio file relates to the quality of the captured audio. It is the number of times per second (Hz) that the analog signal is checked in order to generate the digital signal. For high-quality data, a sample rate of 16,000 Hz or higher is recommended. For lower-quality audio, such as a phone conversation, use a sample rate of 8,000 Hz. Currently, US English (en-US) and US Spanish (es-US) support sample rates up to 48,000 Hz. Other languages support rates up to 16,000 Hz.
In our demo, the file
lib/audioUtils.js contains a
downsampleBuffer() function for reducing the sample rate of the incoming audio bytes from the browser, and a
pcmEncode() function that takes the raw audio bytes and converts them to PCM.
Once we’ve got our audio encoding as PCM data with the right sample rate, we need to wrap it in an envelope before we send it across the WebSocket connection. Each message consists of three headers, followed by the PCM-encoded audio bytes in the message body. The entire message is then encoded as a binary event stream message and sent. If you’ve used the HTTP/2 API before, there’s one difference that I think makes using WebSockets a bit more straightforward, which is that you don’t need to cryptographically sign each chunk of audio data you send.
The messages we receive follow the same general format: they are binary-encoded event stream messages, with three headers and a body. But instead of audio bytes, the message body contains a
Transcript object. Partial responses are returned until a natural stopping point in the audio is determined. For more details on how this response is formatted, check out the docs and have a look at the
handleEventStreamMessage() function in
Let’s See the Demo!
Now that we’ve got some context, let’s try out a demo. I’ve deployed it using AWS Amplify Console – take a look, or push the button to deploy your own copy. Enter the Access ID and Secret Key for the IAM User you authorized earlier, hit the Start button, and speak into your microphone.
The complete project is available on GitHub. The most important file is
lib/main.js. This file defines all our required dependencies, wires up the buttons and form fields in
index.html, accesses the microphone stream, and pushes the data to Transcribe over the WebSocket. The code has been thoroughly commented and will hopefully be easy to understand, but if you have questions, feel free to open issues on the GitHub repo and I’ll be happy to help. I’d like to extend a special thanks to Karan Grover, Software Development Engineer on the Transcribe team, for providing the code that formed that basis of this demo.