Characteristics of a user-aware speech interface

Alexander Rudnicky
Carnegie Mellon University
<$monthname> <$daynum>, <$year>

Speech interfaces need to support interaction with the user, to a degree that allows the user to become comfortable with the capabilities of the system (that is, to develop a reliable model of what the system is capable of). Being able to do so allows the user to recover attention capacity that otherwise needs to be dedicated to monitoring the quality of the human-machine channel. In part this means having the system behave in a manner that resembles that of a human interlocutor, for which the user already possesses a reasonably comprehensive model of behavior and can therefore "predict" the machine's behavior.

The computer needs to display some degree of intelligent behavior, to the degree that the user can "trust" it to do the right thing under various circumstances.

Part of this capability involves being able to communicate with the user on several levels, not only in terms of the task at hand (dictation, query, command) but also on meta levels where the user has the opportunity to control the behavior of the device and also adjust relative expectations about the interaction. Such is available to humans and the consequent dynamic modulation allows humans to adjust channel quality according to the state of the interaction.

This note presents a taxonomy of utterances that users might want to speak to their computer. A usable speech interface needs to display some capability in each of these categories. At the very least it needs to identify which of these states the conversation finds itself at a given point in time.

I. Content

Speak to the task.

The user is fully aware of the task context and what the system might expect of them and they are able to formulate an utterance that falls within the current expectation of the system.

This is the "ideal" level of interaction and one that most system designs assume is the norm in speech interaction. In practice, this is only true of well-practiced tasks and for times at which the system is performing accurately. (That is, times when the user's model of the interaction correctly matches reality.)

II. Possibility

"Computer, what can I say at this point in the task?"

"Computer, repeat the question, the options"

In complex (or unfamiliar) tasks users do not always know what they can do, or more to the point, say. Reasons for this are varied: the user may be genuinely confused about what may be possible in a particular context or they may not be confident that the system will respond (apporpriately and without negative cost) to their particular input (typically because the system is perceived as brittle).

Thus, the user should be able to pose a meta-question to the system, in essence asking about the possibilities of discourse at that point.

A sophisticated system can be proactive and if it can detect that the user is having problems choosing what to say, can intervene and offer some possibilities.

The user is in general terms aware of what the possibilities are in a particular context (that is, they understand the task state they are in) but they may have forgotten exactly what they are dealing with (for example, in a data entry task the range of categories that they are being asked to assign to an observed event). The system should be sufficiently aware of task context (and possibly user state) to be able to offer constructive guidance to the user.

III. Orientation

"Computer, where am I?", "where are we?"

The user may lose track of their location within the hyperspace of the task. They should be able to query the system as to their location. Location is both absolute (in terms of whatever hierarchy is imposed on the task) and also situated within the history of the current session with the system. If the task is well-structured (e.g., form-filling), such information should be easy to come by. If the user in engaged in problem-solving of some sort, this may more difficult to do.

IV. Navigation

Move through the hyperspace: back, previous, next section, main, menu etc.

If the task space is organized in terms of a hypertext, the user needs to have some way of navigating the space; that is, transitioning to another state. The next state can either be relative to the current one (previous, next) or can be described in terms of its meaning (main menu, exit) in the context of the task. The exact definition of proper response to this type of query depends on the degree of structure in the task or perhaps on the availability of such facilities as undo within the task.

V. Control

"Computer: verbose, terse"

"Computer, start listening"

The user needs to be able to control various aspects of system behavior. For example, the nature and quality of the feedback that the system provides or whether the system should be attending to the user in a particular instance.

VI. Customization

"Computer, define the following word"

In a flexible system, the user should be able to modify the behavior of the system on the fly. For example, by establishing a new word in dictation, by entering a verbal shortcut for a sequence of operations or a personal synonym for a non-memorable operation (i.e., a macro). Such a capability may not be desirable in all circumstances, say when a particular system can be used by many different users (unless the user group willingly enlists in the process of augmenting the domain knowledge base). It does make sense for systems with which individuals develop long-term experience, such as a personal dictation system.

Customization is different from adaptation, which constitutes (or should constitute) a passive modification of system capabilities on the basis of the system's experience with a particular user or users.

Customization implies a modal interaction with the user, the purpose of which is to obtain certain well-structured information from the user and to accurately integrate it into the system's knowledge base.