Summary
In this post I'll discuss the audio capabilities of the bot frameworks in AWS and Google. They have different approaches currently, though I think that's changing. AWS Lex is fully-capable processing voice/audio in a single API call. Google Dialogflow has a separation of concerns currently. It takes three API calls to process a voice input and provide a voice response. Interestingly enough, execution time on both platforms is roughly the same.
Voice Interaction Flow - AWS Lex
Diagram below of what things look like on Lex to process a voice interaction. It's really simple. A single API call (PostContent) can take audio as input and provide an audio bot response. Lex is burying the speech-to-text and text-to-speech details such that the developer doesn't have to deal with it. It's nice.
Code Snippet - AWS Lex
Simple function for submitting audio in and receiving audio out below. The PostContent API call can process text or audio.
send(userId, request) { let params = { botAlias: '$LATEST', botName: BOT_NAME, userId: userId, inputStream: request }; switch (typeof request) { case 'string': params.contentType = 'text/plain; charset=utf-8'; params.accept = 'text/plain; charset=utf-8'; break; case 'object': params.contentType = 'audio/x-l16; sample-rate=16000'; params.accept = 'audio/mpeg'; break; } return new Promise((resolve, reject) => { this.runtime.postContent(params, (err, data) => { if (err) { reject(err); } else if (data) { let response = {'text' : data.message}; switch (typeof request) { case 'string': response.audio = ''; break; case 'object': response.audio = Buffer.from(data.audioStream).toString('base64'); break; } resolve(response); } }); }); }
Voice Interaction Flow - Google Dialogflow
Diagram of what the current state of affairs look like with Dialogflow and voice processing. Each function (speech-to-text, bot, text-to-speech) require separate API calls. At least that's the way it is in the V1 Dialogflow API. From what I can tell in V2 (beta), it will allow for audio inputs.
Code Snippet - Google Dialogflow
Coding this up is more complicated than Lex, but nothing cosmic. I wrote some wrapper functions around Javascript Fetch commands and then cascaded them via Promises as you see below.
send(request) { return new Promise((resolve, reject) => { switch (typeof request) { case 'string': this._sendText(request) .then(text => { let response = {}; response.text = text; response.audio = ''; resolve(response); }) .catch(err => { console.error(err.message); reject(err); }); break; case 'object': let response = {}; this._stt(request) .then((text) => { return this._sendText(text); }) .then((text) => { response.text = text; return this._tts(text); }) .then((audio) => { response.audio = audio; resolve(response); }) .catch(err => { console.error(err.message); reject(err); }); } }); }
Results
I didn't expect this, but both platforms performed fairly equally even though multiple calls are necessary on Dialogflow. For my simple bot example, I saw ~ 2 second execution times for audio in/out from both Lex and Dialogflow.
Copyright ©1993-2024 Joey E Whelan, All rights reserved.