Sunday, April 15, 2018

Voice Interactions on AWS Lex + Google Dialogflow


Summary

In this post I'll discuss the audio capabilities of the bot frameworks in AWS and Google.  They have different approaches currently, though I think that's changing.  AWS Lex is fully-capable processing voice/audio in a single API call.  Google Dialogflow has a separation of concerns currently.  It takes three API calls to process a voice input and provide a voice response.  Interestingly enough, execution time on both platforms is roughly the same.

Voice Interaction Flow - AWS Lex

Diagram below of what things look like on Lex to process a voice interaction.  It's really simple.  A single API call (PostContent) can take audio as input and provide an audio bot response.  Lex is burying the speech-to-text and text-to-speech details such that the developer doesn't have to deal with it.  It's nice.


Code Snippet - AWS Lex

Simple function for submitting audio in and receiving audio out below.  The PostContent API call can process text or audio.

  1. send(userId, request) {
  2. let params = {
  3. botAlias: '$LATEST',
  4. botName: BOT_NAME,
  5. userId: userId,
  6. inputStream: request
  7. };
  8. switch (typeof request) {
  9. case 'string':
  10. params.contentType = 'text/plain; charset=utf-8';
  11. params.accept = 'text/plain; charset=utf-8';
  12. break;
  13. case 'object':
  14. params.contentType = 'audio/x-l16; sample-rate=16000';
  15. params.accept = 'audio/mpeg';
  16. break;
  17. }
  18. return new Promise((resolve, reject) => {
  19. this.runtime.postContent(params, (err, data) => {
  20. if (err) {
  21. reject(err);
  22. }
  23. else if (data) {
  24. let response = {'text' : data.message};
  25. switch (typeof request) {
  26. case 'string':
  27. response.audio = '';
  28. break;
  29. case 'object':
  30. response.audio = Buffer.from(data.audioStream).toString('base64');
  31. break;
  32. }
  33. resolve(response);
  34. }
  35. });
  36. });
  37. }

Voice Interaction Flow - Google Dialogflow

Diagram of what the current state of affairs look like with Dialogflow and voice processing.  Each function (speech-to-text, bot, text-to-speech) require separate API calls.  At least that's the way it is in the V1 Dialogflow API.  From what I can tell in V2 (beta), it will allow for audio inputs.


Code Snippet - Google Dialogflow

Coding this up is more complicated than Lex, but nothing cosmic.  I wrote some wrapper functions around Javascript Fetch commands and then cascaded them via Promises as you see below.
  1. send(request) {
  2. return new Promise((resolve, reject) => {
  3. switch (typeof request) {
  4. case 'string':
  5. this._sendText(request)
  6. .then(text => {
  7. let response = {};
  8. response.text = text;
  9. response.audio = '';
  10. resolve(response);
  11. })
  12. .catch(err => {
  13. console.error(err.message);
  14. reject(err);
  15. });
  16. break;
  17. case 'object':
  18. let response = {};
  19. this._stt(request)
  20. .then((text) => {
  21. return this._sendText(text);
  22. })
  23. .then((text) => {
  24. response.text = text;
  25. return this._tts(text);
  26. })
  27. .then((audio) => {
  28. response.audio = audio;
  29. resolve(response);
  30. })
  31. .catch(err => {
  32. console.error(err.message);
  33. reject(err);
  34. });
  35. }
  36. });
  37. }

Results

I didn't expect this, but both platforms performed fairly equally even though multiple calls are necessary on Dialogflow.  For my simple bot example, I saw ~ 2 second execution times for audio in/out from both Lex and Dialogflow.  

Copyright ©1993-2024 Joey E Whelan, All rights reserved.