ASR

From SoftIVR

Jump to: navigation, search

ASR

ASR (Automatic Speech Recognition) allows your callers to control your IVR by speaking commands, rather than pushing buttons on their telephone. When implemented properly, speech recognition can enable applications which would otherwise be too cumbersome for users to navigate; when implemented badly, it can make a straightforward application painful to use.

Grammars

Any speech recognition starts with a grammar - something which defines what the user is likely to say. There are two ways of expressing grammars in common use - ABNF and XML; we'll restrict ourselves to ABNF, as the grammars are represented much more compactly than with XML.

An ABNF grammar starts off with a header, which marks the grammars as ABNF and specifies the language to listen for. A header for a US English grammar is:

#ABNF 1.0 ISO-8859-1;
language en-US;

The free SoftIVR ASR supports en-US (US English), en-GB (British English), en-AU (Australian English), fr-CA (Canadian French), es-MX (Mexican Spanish) and es-CO (South American Spanish). Further languages are available to premium users.

After the header come the rules of the grammar. A very simple grammar, which takes 'yes' or 'no' as input, is:

#ABNF 1.0 ISO-8859-1;
language en-US;
root $yn;
$yn = yes | no;

The 'root' tag specifies the rule which contains everything which could be matched and, in this case, that rule is defined as 'yes | no', where | means 'or'.

Speakers don't always say exactly what they're supposed to, and don't always pronounce things correctly. 'Yes' might be pronounced as 'yep' or 'yeah', and 'no' as 'nope'. We can expand the grammar to include these cases like this:

#ABNF 1.0 ISO-8859-1;
language en-US;
root $yn;
$yn = yes | yeah | yep | no | nope;

which is fine, except that the matched string is what is returned, by default, to the application. So the application code would also need to be aware of the alternatives available to the user.

To separate out the options in the grammar from their meaning as understood by the application code, it is possible to specify tags. In their simplest form, a tag specifies a value to be returned to the application. So, to have our grammar return 'yes' or 'no' to the application irrespective of which variant the caller spoke, we can extend it as follows:

#ABNF 1.0 ISO-8859-1;
language en-US;
root $yn;
$yn = yes {$.yn='yes';} | yeah {$.yn='yes';} | yep {$.yn='yes';} | no {$.yn='no';} | nope {$.yn='no';};

The application can now test the returned 'yn' object for being 'yes' or 'no', and the grammar takes care of the alternatives. This also allows the same application code to be used for different languages:

#ABNF 1.0 ISO-8859-1;
language fr-CA;
root $yn;
$yn = oui {$.yn='yes';} | non {$.yn='no';};

- the grammar is French, but the response back to the application is the same as for an English one.

So, for a first speech-enabled application, enter the following code:

answer();
say("Please say yes or no after the tone.");
playTone(800, 0.3);
grammar = "";
grammar = grammar + "#ABNF 1.0 ISO-8859-1;\n";
grammar = grammar + "language en-US;\n";
grammar = grammar + "root $yn;\n";
grammar = grammar + "$yn = yes {$.yn = 'yes';} | no {$.yn = 'no';};\n";
result = ASR(grammar);
parseXML(result);
playTone(800, 0.3);
say("You said " + getXPath("interpretation/instance/yn") + " with confidence " + getXPath("interpretation/confidence"));

The ASR function returns an XML structure. In this case, an example might be:

<?xml version="1.0"?>
<result>
  <interpretation grammar="session:1119492@softivr" confidence="99">
    <instance>
      <yn>yes</yn>
    </instance>
    <input mode="speech">yes</input>
  </interpretation>
</result>

where the engine has a 99% confidence that the speaker said "yes". The <input>..</input> field shows what the engine thought was said, and the elements in <instance> show the results of any tags set as a result of the recognition.