Sunday, April 15, 2018

Voice Interactions on AWS Lex + Google Dialogflow


Summary

In this post I'll discuss the audio capabilities of the bot frameworks in AWS and Google.  They have different approaches currently, though I think that's changing.  AWS Lex is fully-capable processing voice/audio in a single API call.  Google Dialogflow has a separation of concerns currently.  It takes three API calls to process a voice input and provide a voice response.  Interestingly enough, execution time on both platforms is roughly the same.

Voice Interaction Flow - AWS Lex

Diagram below of what things look like on Lex to process a voice interaction.  It's really simple.  A single API call (PostContent) can take audio as input and provide an audio bot response.  Lex is burying the speech-to-text and text-to-speech details such that the developer doesn't have to deal with it.  It's nice.


Code Snippet - AWS Lex

Simple function for submitting audio in and receiving audio out below.  The PostContent API call can process text or audio.

 send(userId, request) {
  let params = {
          botAlias: '$LATEST',
    botName: BOT_NAME,
    userId: userId,
    inputStream: request
  };
  
  switch (typeof request) {
   case 'string':
    params.contentType = 'text/plain; charset=utf-8';
    params.accept = 'text/plain; charset=utf-8';
    break;   
   case 'object':
    params.contentType = 'audio/x-l16; sample-rate=16000';
    params.accept = 'audio/mpeg';
    break;
  }
  return new Promise((resolve, reject) => {
   this.runtime.postContent(params, (err, data) => {
    if (err) {
     reject(err);
    }
    else if (data) {
     let response = {'text' : data.message};
     switch (typeof request) {
      case 'string':
       response.audio = '';
       break;
      case 'object':
       response.audio = Buffer.from(data.audioStream).toString('base64');
       break;
     }
     resolve(response);
    }
   });
  });
 }

Voice Interaction Flow - Google Dialogflow

Diagram of what the current state of affairs look like with Dialogflow and voice processing.  Each function (speech-to-text, bot, text-to-speech) require separate API calls.  At least that's the way it is in the V1 Dialogflow API.  From what I can tell in V2 (beta), it will allow for audio inputs.


Code Snippet - Google Dialogflow

Coding this up is more complicated than Lex, but nothing cosmic.  I wrote some wrapper functions around Javascript Fetch commands and then cascaded them via Promises as you see below.
 send(request) {
  return new Promise((resolve, reject) => {
   switch (typeof request) {
    case 'string':
     this._sendText(request)
     .then(text => {
      let response = {};
      response.text = text;
      response.audio = '';
      resolve(response);
     })
     .catch(err => { 
      console.error(err.message);
      reject(err);
     });  
     break;
    case 'object':
     let response = {};
     this._stt(request)
     .then((text) => {
      return this._sendText(text);
     })
     .then((text) => {
      response.text = text;
      return this._tts(text);
     })
     .then((audio) => {
      response.audio = audio;
      resolve(response);
     })
     .catch(err => { 
      console.error(err.message);
      reject(err);
     });  
   }
  });
 }

Results

I didn't expect this, but both platforms performed fairly equally even though multiple calls are necessary on Dialogflow.  For my simple bot example, I saw ~ 2 second execution times for audio in/out from both Lex and Dialogflow.  

Copyright ©1993-2024 Joey E Whelan, All rights reserved.