Summary
In this post I'll discuss the audio capabilities of the bot frameworks in AWS and Google. They have different approaches currently, though I think that's changing. AWS Lex is fully-capable processing voice/audio in a single API call. Google Dialogflow has a separation of concerns currently. It takes three API calls to process a voice input and provide a voice response. Interestingly enough, execution time on both platforms is roughly the same.
Voice Interaction Flow - AWS Lex
Diagram below of what things look like on Lex to process a voice interaction. It's really simple. A single API call (PostContent) can take audio as input and provide an audio bot response. Lex is burying the speech-to-text and text-to-speech details such that the developer doesn't have to deal with it. It's nice.
Code Snippet - AWS Lex
Simple function for submitting audio in and receiving audio out below. The PostContent API call can process text or audio.
send(userId, request) {
let params = {
botAlias: '$LATEST',
botName: BOT_NAME,
userId: userId,
inputStream: request
};
switch (typeof request) {
case 'string':
params.contentType = 'text/plain; charset=utf-8';
params.accept = 'text/plain; charset=utf-8';
break;
case 'object':
params.contentType = 'audio/x-l16; sample-rate=16000';
params.accept = 'audio/mpeg';
break;
}
return new Promise((resolve, reject) => {
this.runtime.postContent(params, (err, data) => {
if (err) {
reject(err);
}
else if (data) {
let response = {'text' : data.message};
switch (typeof request) {
case 'string':
response.audio = '';
break;
case 'object':
response.audio = Buffer.from(data.audioStream).toString('base64');
break;
}
resolve(response);
}
});
});
}
Voice Interaction Flow - Google Dialogflow
Diagram of what the current state of affairs look like with Dialogflow and voice processing. Each function (speech-to-text, bot, text-to-speech) require separate API calls. At least that's the way it is in the V1 Dialogflow API. From what I can tell in V2 (beta), it will allow for audio inputs.
Code Snippet - Google Dialogflow
Coding this up is more complicated than Lex, but nothing cosmic. I wrote some wrapper functions around Javascript Fetch commands and then cascaded them via Promises as you see below.
send(request) {
return new Promise((resolve, reject) => {
switch (typeof request) {
case 'string':
this._sendText(request)
.then(text => {
let response = {};
response.text = text;
response.audio = '';
resolve(response);
})
.catch(err => {
console.error(err.message);
reject(err);
});
break;
case 'object':
let response = {};
this._stt(request)
.then((text) => {
return this._sendText(text);
})
.then((text) => {
response.text = text;
return this._tts(text);
})
.then((audio) => {
response.audio = audio;
resolve(response);
})
.catch(err => {
console.error(err.message);
reject(err);
});
}
});
}
Results
I didn't expect this, but both platforms performed fairly equally even though multiple calls are necessary on Dialogflow. For my simple bot example, I saw ~ 2 second execution times for audio in/out from both Lex and Dialogflow.
Copyright ©1993-2024 Joey E Whelan, All rights reserved.












