Integrating the Open VXI

The Open VXI VoiceXML interpreter is a portable open source library that interprets the VoiceXML dialog markup language. It is designed to serve as a reference for parties interested in understanding how VoiceXML might be interpreted, as a component of a VoiceXML-based debugger, browser or other VoiceXML-based system. Although it is perfectly suitable for PC desktop applications, its design reflects VoiceXML's target of telephony platforms.

The Open VXI is intended to be useful in a wide variety of platforms and environments. It does not require any particular speech recognition or speech synthesis system, nor it is specific to any telephony platform. It is also designed to be portable between operating systems and architectures.

Although it provides full interpretation of VoiceXML 1.0, the Open VXI is only one component of an operational VoiceXML platform. The Open VXI does not include the functionality of a speech recognition engine, prompt and Text-to-Speech (TTS) capabilities, or telephony functions. It also requires file-system and Internet services. In addition, moderate and large-scale systems would require administration and management support. Finally, as XML and ECMAscript are integral parts of the VoiceXML specification, a DOM parser and interpreter are required.

As portability is one of its primary design goals, the Open VXI abstracts the requirements of VoiceXML into a set of Interfaces which encapsulate the functionality required from speech recognition, prompt engines, telephony, etc. In turn, the VXI presents its own API for incorporation in other applications.

This document is meant as a guide to integrating the Open VXI based on our experience incorporating it with SpeechWorks 6.5 SDK and Speechify TTS engine. It describes the intent and use of these interfaces. The complete API documentation is given in an appendix. The integration of the VXI into a complete speech browsing platform or service should not (in general) require any modification of the Open VXI source itself.

The integration of VXI into the SpeechWorks platform requires the implementation of the following modules.

Reference implementations of these interfaces that operate with file and or text based input and output are provided in the open source toolkit along with Open VXI.

Interface Design: General Principles

The API's, with the exception of the XML parser, are specified as a set of C functions and should be implemented as a set of libraries (static or dynamic) to be linked and loaded with the VXI. Although there are other interface technologies suitable for this purpose ( C++, Java Interfaces, COM, etc.), a simple C API was chosen for maximum portability across platforms and operating systems. (Although the API implementations must be linked with the VXI, the platform functionality need not be so tightly coupled. The linked libraries can simply be stubs to call local services, components, or remote servers.)

All structured data is passed across interface boundaries as handles (castable to void* and testable against NULL/0). The actual implementation of any structured type is invisible across API boundaries.

All of the C API functions, with the exception of the type-system implementation described below, are defined to return a result code of type VXIresult, defined in vxitypes.h. Possible result codes are also defined in vxitypes.h.

Type System, Strings and Containers

Internally, and in its API's, the Open VXI uses an abstract type system to adapt to different architectures and compilers, as well as to support internationalization of textual data. These are defined in the header file vxitypes.h and, of course, must be redefined if the VXI is compiled for a new architecture.

The Open VXI also provides a library of useful abstract data types, some required by the API's. Again, these are provided with a C API for maximum portability. These are defined in vxivalue.h and include both encapsulated basic types and containers. The basic types include:

The container types include: All container types are polymorphic and the library supports runtime type checking.

XML Parser Integration

The one exception to the principle of C APIs is the integration with the XML parser. Since the Open VXI is perforce very tightly coupled to the XML parser and DOM interface, this interface is not as isolated as the defined C API's. The VXI uses the "Xerces" C++ DOM (Document Object Model) interface published and implemented by Apache (www.apache.org). This implementation is open source and available over the Web (the VXI current compiles to Xerces version 1.2). We recommend using this XML parser/DOM interface for ease of integration. Other integrations are possible, though they may require modification of the VXI source.

ECMAscript Integration

The semantics of VoiceXML are deeply coupled to an embedded ECMAscript interpreter. Indeed, VoiceXML allows the execution of arbitrary ECMAscript scripts. (ECMAscript is an attempt at a standard version of JavaScript, see www.ecma.org for details).

The integration of VXI with ECMAscript (henceforth JavaScript) is defined by the JSI API. This encapsulates the functionality the VXI requires of JavaScript, allowing the use of any JavaScript engine.

Our default integration uses the "SpiderMonkey" JavaScript codebase from Mozilla (www.mozilla.org). This is an open source (Mozilla-license or GPL), C codebase available over the Internet. SpiderMonkey provides a C API to compile and evaluate scripts and manipulate objects. Our JSI library implements the JSI functions to this API, using evaluation of scripts over direct object manipulation whenever possible.

We have found the SpiderMonkey interpreter (although otherwise excellent), to be not completely thread safe. The problems arise when one or more thread attempts to garbage collect. We have addressed this problem in the short term by explicit locking of all calls to JavaScript in the JSI implementation, though this is clearly not optimal, due to contention, and we are investigating alternatives. Experiments on Intel/NT have demonstrated that the locking contention is not a significant performance factor and that the resulting integration is both thread safe and free of memory leaks

Platform Integration

In order to get the Open VXI, with XML parser and JavaScript interpreter, to actually interact with users, we integrated with a SpeechWorks speech and telephony platform. This platform provides speech recognition, audio recording and output, and TTS synthesis, as well as basic call control and DTMF for telephony. The platform capabilities required by the VXI are encapsulated in the Tel, Rec, and Prompt API's, defined in vxitel.h, vxirec.h, and vxiprompt.h, respectively. In addition these API's define routines that are not directly called by VXI but are meant to encapsulate initialization and cleanup functionality at various levels.

A fourth platform API is defined in vxiobj.h and provides a generic interface for calling platform specific functionality. It is used to implement the VoiceXML <object> tag.

Global Initializations

Each platform interface has a global Init() and Shutdown() call (for example, VXIrecInit() for the Rec API). These are not called from the VXI, and intended to be called by the surrounding application. If platform initialization is accomplished by other means, these routines can be ignored. Each Init() routine passes in a VXIObject which can hold any system specific parameters. Usage of this arguments object is completely platform dependent. For example, the text based I/O reference browser included with Open VXI gathers the command line arguments into the arguments object for the Init() routines.

The global ShutDown() routines are meant to perform any platform specific cleanup required for a graceful system shutdown. Again, they are not called by the VXI and can be ignored if shutdown is provided for by other means.

The JSI Interface also has Init() and Shutdown() routines. In our reference integration, the VXIjsiInit() routine must be called by the application before and calls to the VXI.

The Channel Object

The Channel Object is key to the integration of VXI with a speech and telephony platform, as well as OA&M and OS/Internet integration. The Channel Object provides a platform independent mechanism to store, pass around, and retrieve platform-specific data. The Channel Object is implemented as a VXIObject data type and supports the VXIObject access and modification methods.

A Channel Object is created and initialized for each conceptual voice channel. For example, in a PSTN telephony system, one Channel Object might be created for every incoming line. A VXI instance is bound to a particular Channel Object at instance creation and uses the resources on the channel to perform recognition, prompting, etc. A new VXI instance can be created for each call, or one can be created for each channel and re-used. In general, VXI instances are lightweight objects and creation/destruction times are negligible.

All platform API functions called by VXI are channel-specific and pass in the Channel Object as their first argument.

Channel Initializations

Each platform Interface has a channel-specific initialization and destruction function, named CreateResource() and DestroyResource() respectively. Like the global Init() and Shutdown() routines, these are not called directly by the VXI, but are included in the API as a convenient mechanism to set up the Channel Object, which is required by the VXI. These functions are passed a Channel Object and an arguments object. This Channel Object has been previously created and should contain some platform-specific identifier of the channel to be initialized.

Each CreateResource routine should construct any channel-specific objects required by the module implementation and place handles to these objects in the Channel Object (using VXIObjectSetProperty). There is no namespace convention enforced across platforms and each integration needs to develop its own conventions to avoid collisions. In addition, these routines should perform any channel-specific hardware or resource initialization required.

For example, an integration to SWI 6.5 DialogModule API might have the VXIrecCreateResource() routine create the Hardware interface, Factory, and Session objects and store handles on them in the Channel Object. These channel-specific resources will then be retrieved from the Channel Object to implement platform functionality required by the API.

Prompt Interface

VoiceXML allows both pre-recorded and TTS synthesized prompt segments. A prompt in VoiceXML can consist of a mixed sequence of these segments. Pre-recorded audio segments are referenced as URIs, while TTS segments consist of text with optional markup. The distinction between prompts and segments is that several VoiceXML properties (counts, barge-in, etc.) apply at the prompt level, not the individual segment level.

The set of Prompt API functions called by the VXI is fairly small:

promptCreate() creates and returns a handle on a prompt-specific data structure. The VXI does nothing with this handle except pass it to subsequent calls to the Prompt API so its meaning and implementation are entirely platform specific. Prompt-specific properties such as barge-in and recognition timeout, are passed in and bound at this time. These properties are defined to be prompt-specific by VoiceXML.

promptDestroy() notifies the platform that the VXI will not be needing this prompt handle again and so the platform is free to clean up any prompt specific data structures it may have created.

promptAddSegment() adds a segment of prompt data to a prompt object. Audio segments can be either URI's of audio data to be played directly, or specifications of typed data (i.e. currency) to be rendered from a collection of recorded audio segments. URI's in audio segments are specified with a Fetch Object (described in the OS integration section). See the API documentation for complete specification of arguments to promptAddSegment(). TTS segments consist of text potentially containing JSML markup.

promptQueue() Queues a previously constructed prompt. Returns immediately. [Note: The VXI calls promptDestroy immediately after promptQueue if it has no further need for a prompt.]

There is one other prompt API function, promptWait(). This blocks until all queued prompts on this channel have been played. It is not called by VXI, but is specified in the API as a convenience. The controlling application should wait for all prompts to be played before hanging up on the caller.

Recognition Interface

The Recognition API is the most difficult to implement since it must perform some fairly complex grammar management if it fully implements the VoiceXML specification. The main routines called by the VXI are:

Grammar Loading and Activation

When a VoiceXML document is loaded, the VXI issues LoadGrammar() calls for all required grammars (unless the VoiceXML document forbids this). Grammars can come from several sources: DTMF and speech grammars are treated identically and the recognition interface is responsible for mapping DTMF tones into specified return values. An implementation of the interface may access the Telephony Interface through the Channel Object to accomplish this.

Each <choice> element creates a separate grammar, while the set of <option> elements in a <field> create a single grammar.

The LoadGrammar routine merely instructs the Recognition interface to be prepared to possibly recognize using this grammar is the near future. It returns a Grammar handle to the VXI which is used in determining which grammar was triggered by the user's utterance.

After the document's grammars are loaded, the VXI proceeds into the Form Interpretation Algorithm and deactivates some grammars and activates others by calling DeactivateGrammar() and ActivateGrammar. The Recognition Interface must keep track of those grammars currently activated. Usually, the VXI will first deactivate all loaded grammars, then activate those it needs. Finally, the VXI issues a call to Recognize(). which starts the recognizer with the current set of activated grammars.

The Recognize() function produces a VXIObject containing, in specified properties:

The fact that the Grammar handle is returned in the result object is key to the VoiceXML Form Interpretation Algorithm which supports multiple enabled grammars and grammar-specific response processing.

The recFreeGrammar function is issued by the VXI to indicate that it will not require a given grammar until another LoadGrammar call is issued. The Recognition Interface can decide on its own how and when to free actual grammar-related resources. Given the typical scenario of a multi-line service connecting to a small number of VoiceXML servers, it is likely the same grammars will be used again and again and caching in the Recognition Interface should greatly improve performance.

The recSetPropeties function is straightforward and is used to set recognition properties, such as thresholds and timeouts.

The recRecord function records a segment of audio and returns it to the VXI. The VXI currently does not support the VoiceXML <record> element; it is implemented but disabled, because the ability to store and manipulate the resulting recording in ECMAscript as required by VoiceXML is not yet clearly specified. This requires an extensions to ECMAscript which we have not yet implemented. As a consequence the recRecord() function is currently never called in our reference implementation and the <record> element throws an event.

In addition to the functions above, there are two others not called by VXI, but nonetheless useful. recBeginSession and recEndSession can encapsulate any call-specific initialization or logging that is desired. Again, these functions may be called by the surrounding application but are not directly called from the VXI.

Telephony Interface

Most of the work in the Telephony interface is in the various initialization routines. The VXI calls only three routines: The functionality of the first straightforward. The two transfer functions differ in whether they block (bridge) or return immediately (blind). Depending on the platform, a transfer may occupy telephony resources, such as lines, even after the VXI has returned. This must be dealt with by the surrounding application and is outside the scope of the VXI. The two transfer functions return a status code (see API doc for details) and some duration information.

The VXI currently does not support recognition on a bridged call (ie eavesdropping).

Object Interface

The Object Interface is extremely basic and, outside of initializations, consists of only one call objectExecute(). The VXI merely collects all parameters and all arguments from attributes into respective VXIObject's and calls Execute(). This needs to locate and invoke the required platform objects. Upon completion, it returns a VXIObject containing its results which is imported by VXI into the JavaScript environment.



VoiceXML is a Trademark of the VoiceXML forum.

Copyright 2000, 2001. SpeechWorks International, Inc. All rights reserved. Distributed under SpeechWorks Open Document License, v1.0