|
CMU CommunicatorReferences and Manuals |
The Communicator Travel Planning system is made up of 10 separate modules. The source code for these modules is available as part of the CMU Spoken Dialog System Toolkit. The purpose of the current document is to provide an overview of what each module does. In some cases greater detail may be provided in technical articles produced by the Carnegie Mellon Communicator group, as cited below. The goal of the present document is not to provide detailed documentation for the individual modules but rather to explain the purpose of the module in the overall system and to describe the principles on which it operates.
Location
cmu/servers/src/gentner/
Summary:
The Communicator system is session-oriented. That is, interactions
with users take place over a specific channel, over a defined interval
of time. This module is used to initiate and terminate a
session. Initiation and termination are typically under the control of
the user: who can either call up the system on the telephone or click the
Start button in the Emulator application's window; they end the session
by hanging up or by clicking on the Stop button. Occasionally the
system may decide to hang up on a particular user (if it judges that
the session has gone irreparably astray). In practical terms, a
"session" causes the decoder to start listening and a
new session log to be created. The session log includes an event-level
trace of the call and recordings of the speech produced by the
user.
If you are using the system in desktop mode (which is the default in the distribution) calls are controlled through a panel on the screen.
If you are using the system in telephony mode, you need to install the necessary hardware. In either case, the audio interface is the same and is the standard sound board in your computer. The additional hardware consists of an echo-cancellation device and a serial interface box that allows the computer to communicate with the echo canceller. We have been happy with the Gentner DH20 device.
back to the top
Sphinx/Listener
Location:
cmu/servers/src/sphinx/
Summary:
The Listener segments the input stream into utterances, determines
whether a candidate utterance should be attended to (potentially
triggering a barge-in).
Sphinx decodes the input utterance to produce
a top-1 hypothesis and adds a confidence marker to each word in the
hypothesis. Sphinx interacts with other modules. Specifically, it
will, on barge-in, send a signal to the synthesis module to stop
speech output; it will accept language-model switching messages from
the dialog manager (the system uses state-specific language models to
improve recognition accuracy) and it will send the final decoding to
the Phoenix parser module.
Sphinx needs to be configured with a domain-appropriate set of acoustic models, lexical models and language models (these are included in the distribution). The Communicator language model is class-based, so in fact it comes in two components, a trigram language model and a set of class membership definitions. Lexical items in classes have intra-class probabilities specified. In addition there is a set of lexical models (the dictionary) together with a configuration file that specifies (e.g.) search parameters for the decoder.
Location:
cmu/servers/src/phoenix/
Summary:
Phoenix parses the decoding it receives from Sphinx, using a semantic
grammar. Phoenix can potentially produce multiple parses. Parses
consist of hierarchical slot arrays, with the slots corresponding to
semantic entities in the domain ontology. The ontology is simple and
is best thought of as a type hierarchy. The structure of the grammar
directly mirrors the ontology and additionally includes
(domain-independent) discourse concepts, for example "yes"
and "no". Together these define the expected user language
for the domain.
Note that not all concepts used in the system need to have a corresponding expression in (spoken) language: some can be used to semantically code other events of interest to the system. These may include events in other modalities (such as pointing gestures) or asynchronous events generated in the domain (for example alarms). The Communicator system does not make use of such events but other CMU systems (e.g., LARRI) do.
A note on language development: It's been our observation that while at first seemingly highly variable, human language in well-defined domains is for practical purposes limited: there are only so many ways to express in-domain information. Observed language will be a function of the domain, the user population and the skill with which the system provides guidance on use and manages exploratory behavior. In practical terms, a high-coverage grammar can be constructed from a combination of transcribed in-domain speech and reuse of common sub-languages such as those for dates, times and numbers. The Communicator Travel Planning grammar is based on an ATIS grammar augmented through the analysis of an approximately 20,000 utterance Communicator corpus.
Location:
cmu/servers/src/helios
Summary:
Helios is the "post-parser". At present its role is to
assess the level of confidence for an incoming parse using information
from the decoder, parse and dialog levels of the system. The
accordingly annotated parse is then sent to the dialog manager. While
Helios makes use of learned parameters to assign confidence, there are
no domain-specific components in it.
Helios is also the locus for multi-modal integration. The Communicator system does not use multi-modal input; the LARRI system at Carnegie Mellon, based on the same architecture, incorporates a multi-modal Helios module that integrates speech and manual inputs.
back to the top
Dialog Manager
Location:
cmu/servers/src/dm_server
Summary:
The Communicator Dialog Manager implements the Carnegie Mellon AGENDA
dialog manager. The module contains an execution engine, and a
handler library. The library is domain-specific and
contains individual handlers and handler (sub-)trees, both of which
are assembled into a dynamic product tree over the course of a
session. The Engine additionally manages the dialog agenda
which controls the interpretation of user inputs.
Handlers are implemented as C++ objects and incorporate logic for interpreting particular inputs, interacting with domain agents or for managing child-nodes in the product tree. The product is built up dynamically over the course of a session; as a consequence the system does not follow a dialog "script" in the conventional sense, rather the sequence of interactions is determined by (legal) extensions to the product and by user topic-focusing behavior. The Dialog Manager focuses on the task and discourse aspects of the dialog and performs minimal domain-specific reasoning, which is primarily located in the ABE module. You can read a fuller description of the AGENDA dialog manager.
Location:
cmu/servers/src/datetime3
Summary:
DateTime interprets temporal expressions in user input and resolves
them to absolute dates and times. DateTime has knowledge of holidays
commonly observed in the United States. It does not however provide
full coverage for religious holidays (particularly those that require
computations based on external events). The module operates on (date
and time) fragments of the input parse and makes use of context
information (such the current time and date) to resolve relative
expressions (e.g., "tomorrow"). The module also maps
numeric expressions.
While the DateTime module was developed specifically to cover expressions encountered in the travel domain, the date and time (and number) sub-languages appear to be largely domain-independent, so this module can be easily used as is in new domains.
back to the top
ABE (Airline Back End) and database
Location:
cmu/systems/abe
Summary:
ABE performs a variety of domain-specific functions and is in some
sense the "application" that the dialog system interfaces to.
The functions include access to information in the system database,
retirieval of information on the web and domain-specific reasoning.
ABE interfaces to web-based resources to obtain information about flights and hotels. Information includes schedules and prices for flights and locations, prices and availability for hotels. The information is live although some of it is also cached for varying durations. ABE also incorporates domain-specific reasoning to deal with, for example, the resolution of ambiguous references ("Is that Portland in Maine or Portland in Oregon?") and managing solution sets (for example, ranking flights on "desirability").
ABE interacts with the database, which contains geographical information (about 500 world-wide destinations) and information about airlines. The database also contains information about how users might refer to various entities in the domain (for example airport names) and information about how the system should in turn refer to entities when speaking to the user.
Location:
cmu/systems/profile
Summary:
The Profile module manages information about individuals known to the Travel Planning system. The user profile notes various preferences (for airlines or hotels; where to email confirmation of a itinerary, calling frequency, etc.) The information is kept in a database. The profile feature is used in the Carnegie Mellon system to manage personalization. It is disabled in the current distribution of the system. (However it can be activated if desired.)
Location:
cmu/systems/nlg
Summary:
Rosetta is the language generation module. It receives semantically
coded requests from the dialog manager and computes a corresponding
word string that can be spoken. Rosetta incorporates two generation
strategies: templates and stochastic. Rosetta also makes use of
information in the database to obtain expressions for particular
domain entities (for example airport names). The
stochastic generation
component makes use of language models built from a corpus of
transcribed travel-agent speech to generate natural language for
output expressions common to human travel agents and the dialog
system. Other output (such as greetings or error notifications) are
handled by template.
Location:
festival
Summary:
Speech synthesis is done using the
Festival system
in a limited domain
mode. Festival is a concatenative synthesizer. That is, a database of
recorded human speech is used to create requested outputs by selecting
appropriate units from the database and combining these by
splicing. Limited-domain means that the database was recorded
expressly for this application and contains complete forms of
frequently encountered items (for example city names). This minimizes
the need for intra-word splicing and consequently results in
higher-quality output speech.
back to the top
Process Monitor
Location:
cmu/servers/src/pmonitor-10-18-01
Summary:
The system can be (should be) started up and brought down using the Process Monitor application. The Process Monitor references a list to bring up (in sequence) the different modules in the system. It additionally monitors system processes and can restart any that have died. It will optionally email someone if it doesn't succeed in restoring the system to operational status.
back to the top
Implementation Notes
The Travel Planning systems uses the Galaxy architecture for inter-module communication and for logging. The core modules of the system are implemented in C/C++ and uses native Galaxy messages to communicate information. The nature of the massages and their routing is noted in the "hub" program. Some of the modules, in particular ABE, is implemented in Perl. We found it convenient to isolate these modules, so the actual implementation consists of the module itself plus a Galaxy-based proxy that manages communication between the module and the rest of the (Galaxy-based) system.
| Please email all comments and feedbacks to Yitao Sun. |