Summary
In this post I'll discuss the audio capabilities of the bot frameworks in AWS and Google. They have different approaches currently, though I think that's changing. AWS Lex is fully-capable processing voice/audio in a single API call. Google Dialogflow has a separation of concerns currently. It takes three API calls to process a voice input and provide a voice response. Interestingly enough, execution time on both platforms is roughly the same.
Voice Interaction Flow - AWS Lex
Diagram below of what things look like on Lex to process a voice interaction. It's really simple. A single API call (PostContent) can take audio as input and provide an audio bot response. Lex is burying the speech-to-text and text-to-speech details such that the developer doesn't have to deal with it. It's nice.
Code Snippet - AWS Lex
Simple function for submitting audio in and receiving audio out below. The PostContent API call can process text or audio.
- send(userId, request) {
- let params = {
- botAlias: '$LATEST',
- botName: BOT_NAME,
- userId: userId,
- inputStream: request
- };
- switch (typeof request) {
- case 'string':
- params.contentType = 'text/plain; charset=utf-8';
- params.accept = 'text/plain; charset=utf-8';
- break;
- case 'object':
- params.contentType = 'audio/x-l16; sample-rate=16000';
- params.accept = 'audio/mpeg';
- break;
- }
- return new Promise((resolve, reject) => {
- this.runtime.postContent(params, (err, data) => {
- if (err) {
- reject(err);
- }
- else if (data) {
- let response = {'text' : data.message};
- switch (typeof request) {
- case 'string':
- response.audio = '';
- break;
- case 'object':
- response.audio = Buffer.from(data.audioStream).toString('base64');
- break;
- }
- resolve(response);
- }
- });
- });
- }
Voice Interaction Flow - Google Dialogflow
Diagram of what the current state of affairs look like with Dialogflow and voice processing. Each function (speech-to-text, bot, text-to-speech) require separate API calls. At least that's the way it is in the V1 Dialogflow API. From what I can tell in V2 (beta), it will allow for audio inputs.
Code Snippet - Google Dialogflow
Coding this up is more complicated than Lex, but nothing cosmic. I wrote some wrapper functions around Javascript Fetch commands and then cascaded them via Promises as you see below.
- send(request) {
- return new Promise((resolve, reject) => {
- switch (typeof request) {
- case 'string':
- this._sendText(request)
- .then(text => {
- let response = {};
- response.text = text;
- response.audio = '';
- resolve(response);
- })
- .catch(err => {
- console.error(err.message);
- reject(err);
- });
- break;
- case 'object':
- let response = {};
- this._stt(request)
- .then((text) => {
- return this._sendText(text);
- })
- .then((text) => {
- response.text = text;
- return this._tts(text);
- })
- .then((audio) => {
- response.audio = audio;
- resolve(response);
- })
- .catch(err => {
- console.error(err.message);
- reject(err);
- });
- }
- });
- }
Results
I didn't expect this, but both platforms performed fairly equally even though multiple calls are necessary on Dialogflow. For my simple bot example, I saw ~ 2 second execution times for audio in/out from both Lex and Dialogflow.
Copyright ©1993-2024 Joey E Whelan, All rights reserved.