Sunday, July 5, 2015

Speech Transcription with IBM Watson


Summary

This is a continuation of the article I posted on converting call recordings into email/email attachments.  As a review, I used some Node.js modules to implement a simple web server that would accept a HTTP POST from a VXML application that recorded a message from a caller.  That recorded message was then sent as an attachment to an email.

I'm going to further expand that example in this article with transcriptions of recorded messages using the IBM Watson Developer Cloud.  IBM has exposed a number of nifty/complex services to the developer community via REST API's.  Speech to Text is just one of those services and was just recently released GA.

Accessing Watson

IBM has published some pretty decent documentation on this cloud offering.  Getting signed up on their provisioning environment (Bluemix) is similarly painless.  The speech to text service can be accessed via a REST API documented here.  Additionally, there is a Node.js wrapper module for that API available here.  I'll be using the Node module exclusively in this article.

Below is the code I used for some basic testing:


'use strict';

var watson = require('watson-developer-cloud');
var fs = require('fs');


var speech_to_text = watson.speech_to_text({
  username: 'yourUserName',
  password: 'yourPassword',
  version: 'v1',
  url: 'https://stream.watsonplatform.net/speech-to-text/api'
});

var params = {
  audio: fs.createReadStream('./toEleven.flac'),
  content_type: 'audio/flac',
  continuous: 'true'
};

speech_to_text.recognize(params, function(err, recEvent) {
  if (err) {
    console.log('error was returned');
    console.log(err);  
  }
  else {
    console.log(JSON.stringify(recEvent, null, 2));   
  }
});

Line 3:  This the Node wrapper module mentioned previously.
Lines 7-12:  Environmental set up for the service.
Lines 14-18:  Setting up parameters to be passed to the service.  The engine currently supports FLAC, l16, and just recently - WAVE.  More on that .wav support later.
Lines  20-30:  Passes the previously configured parameters to Watson and returns either an error or an object containing the transcription via a callback.

Testing Watson

I did a couple tests.  The first test was an audio clip from the movie This is Spinal Tap.  This test used the scientifically ground-breaking 'These go to eleven' clip (I converted the MP3 file to FLAC).

 Below are the results.
{
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.8842840194702148,
          "transcript": "what we do is if we need that extra push over the cliff you know we do put up to eleven exactly one now why don't you just make ten louder and make ten be the top "
        }
      ],
      "final": true
    },
    {
      "alternatives": [
        {
          "confidence": 0.8642632365226746,
          "transcript": "number and make that a little out "
        }
      ],
      "final": true
    },
    {
      "alternatives": [
        {
          "confidence": 0.7872022390365601,
          "transcript": "these go to eleven "
        }
      ],
      "final": true
    }
  ],
  "result_index": 0
}

Watson actually did a respectable job with this sound clip.  Keep in mind, there are two people talking here, one of which has a fairly pronounced accent.

The results of the second test were less inspirational.  For this test, I used the opening 16 seconds of lyrics from the Motley Crue hit "Girls, Girls, Girls."


Results below.

{
  "results": [
    {
      "alternatives": [
        {
          "confidence": 1,
          "transcript": "nnnnn "
        }
      ],
      "final": true
    },
    {
      "alternatives": [
        {
          "confidence": 0.9949952960014343,
          "transcript": "nnnnn "
        }
      ],
      "final": true
    },
    {
      "alternatives": [
        {
          "confidence": 1,
          "transcript": "nnnnn "
        }
      ],
      "final": true
    },
    {
      "alternatives": [
        {
          "confidence": 0.36961859464645386,
          "transcript": "wned "
        }
      ],
      "final": true
    },
    {
      "alternatives": [
        {
          "confidence": 0.9999980330467224,
          "transcript": "nnnnn "
        }
      ],
      "final": true
    }
  ],
  "result_index": 0
}

Clearly, Watson had issues here.  Its acoustic models don't include support for music audio.  Or, one could say Watson can handle 80's Hair Band conversations, but not Hair Band music itself.  Armed with that crucial bit of information, I moved on to a pseudo-production implementation.

Implementation

The overall architecture of this application remains the same.  Figure 1 below depicts the addition of Watson to the application flow.

Figure 1

As mentioned previously, the Watson's Speech to Text service was just recently released GA.  More important as far as this particular exercise, support for an audio model corresponding to voice recording formats was just released in the past few days.  Watson only supported 16 kHz audio previously.  Now, there is REST API support for 8 kHz audio - which is standard for voice recordings.

The Node module for Watson has not kept pace with these REST API changes.  That's actually understandable given how recently the changes were implemented.  I suspect this will be corrected very soon.  In the mean time; it's an easy fix to modify the existing Watson module source to allow support of voice recordings.

The v1.js file within the Watson module is where the source modification needs to occur.   Specifically, the 'recognize' method needs to be modified so that it will pass the 'model' query parameter to the Watson REST service.

SpeechToText.prototype.recognize = function(params, callback) {

  var missingParams = helper.getMissingParams(params, ['audio', 'content_type']);
  if (missingParams) {
    callback(new Error('Missing required parameters: ' + missingParams.join(', ')));
    return;
  }
  if (!isStream(params.audio)) {
    callback(new Error('audio is not a standard Node.js Stream'));
    return;
  }

  var queryParams = pick(params, ['continuous', 'max_alternatives', 'timestamps',
    'word_confidence','inactivity_timeout','model']);

Line 14:  This is all that is necessary.  Just add 'model' to this array.


Main code body with the modifications necessary to support speech-to-text.  The highlighted area contains all of the mods.

     // Interface for POSTing a recording and generating an email notification
        appHttp.post('/upload', function(req, res) {
          try {
            logger.debug('Entering - File: main.js, Method: appHttp.post()');
                        
            var form = new multiparty.Form();
            var ani = null;
            var dnis = null;
            var fname = null;
            var msg = null;
            var size = 0;
                        
            form.on('error', function(err, statCode) {
              logger.error('File: main.js, Method: appHttp.post(), form(), Error: ' + err.message);
              res.status(statCode || 400).end();
            });
                        
            form.on('part', function(part) {
              var data=[];
                          
              part.on('error', function(err, statCode) {
                form.emit('error', err, statCode);
              });
                          
              part.on('data', function(chunk) {
                size += chunk.length;
                if (size > properties.maxUploadSize) {
                  //covers a degenerate case of too large of an upload.  Possible DOS attempt
                  part.emit('error', new Error('Upload exceeds maximum allowed size'), 413);
                }
                else {  
                  data.push(chunk);
                }
              });
                         
              part.on('end', function() {
                switch (part.name) {
                  case 'ANI':
                    ani = data.toString();
                    break;
                  case 'DNIS':
                    dnis = data.toString();
                    break;
                  case 'MSG':
                    if (part.filename) {
                      fname = part.filename;
                      msg = Buffer.concat(data);
                    }
                    else {
                      part.emit('error', new Error('Malformed file part in form'), 400);
                    }
                    break;
                  default:
                    part.emit('error', new Error('Unrecognized part in form'), 400);
                    break;
                }      
              });
            });
                        
            form.on('close', function() {
              if (ani && dnis && fname && msg) {
                res.status(200).sendFile(__dirname + '/vxml/response.vxml');
                var mailOptions = {
                  from : properties.emailFromUser,
                  to : properties.emailToUser,
                  subject : 'Recorded Message - ANI:' + ani + ', DNIS:' + dnis,
                  text : 'The attached recorded audio message was received.',
                  attachments : [{filename : fname, content : msg}]
                };
                
                var bufStream = new stream.PassThrough();
                bufStream.end(msg);
                var params = {
                  audio: bufStream,
                  content_type: 'audio/wav',
                  model: 'en-US_NarrowbandModel',
                  continuous: 'true'
                };
                
                speech_to_text.recognize(params, function(err, recEvent) {
                  if (err) {
                    logger.error('File: main.js, Method: appHttp.post(), Speech to Text error');     
                  }
                  else {
                    if (recEvent && recEvent.results) {
                      mailOptions.text += '\n\nTranscripted message below:\n\n'; 
                      for (var i=0; i < recEvent.results.length; i++) {
                        var result = recEvent.results[i];
                        if (result && result.final && result.alternatives && result.alternatives.length > 0) {
                          mailOptions.text += result.alternatives[0].transcript + '\n';
                        }   
                      }
                    }
                  }
                  transporter.sendMail(mailOptions, function(error, info) {
                    if (error) {
                      appHttp.emit('error', error);
                    }
                    logger.debug('Exiting - File: main.js, Method: appHttp.post()');
                  });   
                });    
              }
              else {
                form.emit('error', new Error('Form missing required fields'), 400);
              }             
            });
                        
            form.parse(req); 
          }

Lines 71-72:  The Watson API expects the audio to be passed as a Stream object.  So, convert the existing Buffer object containing the recorded audio to a stream.
Lines 73-78:  I set the audio type to .wav.  The VXML <record> tag also needs to be modified accordingly.  The critical item is the 'model' property.  This is where the acoustic model can be set to 8 kHz (en-US_NarrowbandModel).  The default is 16 kHz (en-US_BroadbandModel).
Lines 80-84:  Invoke the Watson speech-to-text service as described earlier.  Iterate through the results array and compile the speech transcript to the email body.

Output

Below is the resulting email for this recorded message:  "Hello this is Fred Flintstone.  I'm calling regarding the box of rocks I ordered last week.  Please give me a call back at 123-456-7890.  Thank you."

From: yourFromAddress@gmail.com
To: yourToAddress@gmail.com
Subject: Recorded Message - ANI:1234567890, DNIS:9876543210
X-Mailer: nodemailer (1.3.4; +http://www.nodemailer.com;
 SMTP/1.0.3[client:1.2.0])
Date: Sun, 05 Jul 2015 19:31:50 +0000
Message-Id: <1436124711412-27b58ceb-6996ffc2-a6e8560f data-blogger-escaped-gmail.com="">
MIME-Version: 1.0

------sinikael-?=_1-14361247108010.760849520098418
Content-Type: text/plain; format=flowed
Content-Transfer-Encoding: 7bit

The attached recorded audio message was received.

Transcripted message below:

hello this is fred flintstone i'm calling regarding the box of rocks i 
ordered last week 
these give me a call back at one two three four five six seven eight nine 
zero thank you 

------sinikael-?=_1-14361247108010.760849520098418
Content-Type: audio/wav
Content-Disposition: attachment; filename=MSG-1436124698682.wav
Content-Transfer-Encoding: base64