The Open VXI is intended to be useful in a wide variety of platforms and environments. It does not require any particular speech recognition or speech synthesis system, nor it is specific to any telephony platform. It is also designed to be portable between operating systems and architectures.
Although it provides full interpretation of VoiceXML 1.0, the Open VXI is only one component of an operational VoiceXML platform. The Open VXI does not include the functionality of a speech recognition engine, prompt and Text-to-Speech (TTS) capabilities, or telephony functions. It also requires file-system and Internet services. In addition, moderate and large-scale systems would require administration and management support. Finally, as XML and ECMAscript are integral parts of the VoiceXML specification, a DOM parser and interpreter are required.
As portability is one of its primary design goals, the Open VXI abstracts the requirements of VoiceXML into a set of Interfaces which encapsulate the functionality required from speech recognition, prompt engines, telephony, etc. In turn, the VXI presents its own API for incorporation in other applications.
This document is meant as a guide to integrating the Open VXI based on our experience incorporating it with SpeechWorks 6.5 SDK and Speechify TTS engine. It describes the intent and use of these interfaces. The complete API documentation is given in an appendix. The integration of the VXI into a complete speech browsing platform or service should not (in general) require any modification of the Open VXI source itself.
The integration of VXI into the SpeechWorks platform requires the implementation of the following modules.
All structured data is passed across interface boundaries as handles (castable to void* and testable against NULL/0). The actual implementation of any structured type is invisible across API boundaries.
All of the C API functions, with the exception of the type-system implementation described below, are defined to return a result code of type VXIresult, defined in vxitypes.h. Possible result codes are also defined in vxitypes.h.
The Open VXI also provides a library of useful abstract data types, some required by the API's. Again, these are provided with a C API for maximum portability. These are defined in vxivalue.h and include both encapsulated basic types and containers. The basic types include:
The integration of VXI with ECMAscript (henceforth JavaScript) is defined by the JSI API. This encapsulates the functionality the VXI requires of JavaScript, allowing the use of any JavaScript engine.
Our default integration uses the "SpiderMonkey" JavaScript codebase from Mozilla (www.mozilla.org). This is an open source (Mozilla-license or GPL), C codebase available over the Internet. SpiderMonkey provides a C API to compile and evaluate scripts and manipulate objects. Our JSI library implements the JSI functions to this API, using evaluation of scripts over direct object manipulation whenever possible.
We have found the SpiderMonkey interpreter (although otherwise excellent), to be not completely thread safe. The problems arise when one or more thread attempts to garbage collect. We have addressed this problem in the short term by explicit locking of all calls to JavaScript in the JSI implementation, though this is clearly not optimal, due to contention, and we are investigating alternatives. Experiments on Intel/NT have demonstrated that the locking contention is not a significant performance factor and that the resulting integration is both thread safe and free of memory leaks
A fourth platform API is defined in vxiobj.h and provides a generic interface for calling platform specific functionality. It is used to implement the VoiceXML <object> tag.
The global ShutDown() routines are meant to perform any platform specific cleanup required for a graceful system shutdown. Again, they are not called by the VXI and can be ignored if shutdown is provided for by other means.
The JSI Interface also has Init() and Shutdown() routines. In our reference integration, the VXIjsiInit() routine must be called by the application before and calls to the VXI.
A Channel Object is created and initialized for each conceptual voice channel. For example, in a PSTN telephony system, one Channel Object might be created for every incoming line. A VXI instance is bound to a particular Channel Object at instance creation and uses the resources on the channel to perform recognition, prompting, etc. A new VXI instance can be created for each call, or one can be created for each channel and re-used. In general, VXI instances are lightweight objects and creation/destruction times are negligible.
All platform API functions called by VXI are channel-specific and pass in the Channel Object as their first argument.
Each CreateResource routine should construct any channel-specific objects required by the module implementation and place handles to these objects in the Channel Object (using VXIObjectSetProperty). There is no namespace convention enforced across platforms and each integration needs to develop its own conventions to avoid collisions. In addition, these routines should perform any channel-specific hardware or resource initialization required.
For example, an integration to SWI 6.5 DialogModule API might have the VXIrecCreateResource() routine create the Hardware interface, Factory, and Session objects and store handles on them in the Channel Object. These channel-specific resources will then be retrieved from the Channel Object to implement platform functionality required by the API.
The set of Prompt API functions called by the VXI is fairly small:
promptDestroy() notifies the platform that the VXI will not be needing this prompt handle again and so the platform is free to clean up any prompt specific data structures it may have created.
promptAddSegment() adds a segment of prompt data to a prompt object. Audio segments can be either URI's of audio data to be played directly, or specifications of typed data (i.e. currency) to be rendered from a collection of recorded audio segments. URI's in audio segments are specified with a Fetch Object (described in the OS integration section). See the API documentation for complete specification of arguments to promptAddSegment(). TTS segments consist of text potentially containing JSML markup.
promptQueue() Queues a previously constructed prompt. Returns immediately. [Note: The VXI calls promptDestroy immediately after promptQueue if it has no further need for a prompt.]
There is one other prompt API function, promptWait(). This blocks until all queued prompts on this channel have been played. It is not called by VXI, but is specified in the API as a convenience. The controlling application should wait for all prompts to be played before hanging up on the caller.
Each <choice> element creates a separate grammar, while the set of <option> elements in a <field> create a single grammar.
The LoadGrammar routine merely instructs the Recognition interface to be prepared to possibly recognize using this grammar is the near future. It returns a Grammar handle to the VXI which is used in determining which grammar was triggered by the user's utterance.
After the document's grammars are loaded, the VXI proceeds into the Form Interpretation Algorithm and deactivates some grammars and activates others by calling DeactivateGrammar() and ActivateGrammar. The Recognition Interface must keep track of those grammars currently activated. Usually, the VXI will first deactivate all loaded grammars, then activate those it needs. Finally, the VXI issues a call to Recognize(). which starts the recognizer with the current set of activated grammars.
The Recognize() function produces a VXIObject containing, in specified properties:
The recFreeGrammar function is issued by the VXI to indicate that it will not require a given grammar until another LoadGrammar call is issued. The Recognition Interface can decide on its own how and when to free actual grammar-related resources. Given the typical scenario of a multi-line service connecting to a small number of VoiceXML servers, it is likely the same grammars will be used again and again and caching in the Recognition Interface should greatly improve performance.
The recSetPropeties function is straightforward and is used to set recognition properties, such as thresholds and timeouts.
The recRecord function records a segment of audio and returns it to the VXI. The VXI currently does not support the VoiceXML <record> element; it is implemented but disabled, because the ability to store and manipulate the resulting recording in ECMAscript as required by VoiceXML is not yet clearly specified. This requires an extensions to ECMAscript which we have not yet implemented. As a consequence the recRecord() function is currently never called in our reference implementation and the <record> element throws an event.
In addition to the functions above, there are two others not called by VXI, but nonetheless useful. recBeginSession and recEndSession can encapsulate any call-specific initialization or logging that is desired. Again, these functions may be called by the surrounding application but are not directly called from the VXI.
The VXI currently does not support recognition on a bridged call (ie eavesdropping).