To provide speech understanding services, we developed a backward-compatible extension to html which facilitates the incorporation of language-specific information into hypertext documents. This approach is somewhat different from that commonly chosen by others [1, 6, 10, 5] which is to store such information in data structures that are parallel to the browser's internal representation of the information on a hypertext page. This is a workable approach in cases where speech is meant to support primarily navigation (i. e., ``following links''). However, we were also interested in using native html data entry conventions, in particular the FORM construct, to capture inspection data in a manner that could take advantage of existing browser mechanisms.
Our extensions to the mark-up language allow direct association of grammar fragments with html clauses, specifically anchors and actions inside FORMs. The grammar information is extracted by the speech-aware version of Mosaic (TESSERA) and is merged into a generic browsing language that allows for voice input of display manipulation commands (such as for scrolling or for traversing the history list). No attempt was made to allow voice control of every aspect of the interface as most were not relevant to the task at hand.
As the browser receives a speech-enabled page, it parses it in its normal fashion. The Grammar Builder component then traverses the parse tree and extracts information from any GRAMMAR fields. These are used to dynamically create a grammar fragment that encompasses all speakable items on the page. This partial grammar is then merged with the statically-defined browser grammar to produce the active grammar for that page. This grammar is made available to the PHOENIX parser and is also used to derive a bigram language model for the benefit of the decoder. Since the domain language is known beforehand, pronunciations for words can be compiled off-line for efficiency, though these could be obtained as needed from a server (an alternate solution which we have also implemented).
Initial GRAMMAR clauses were generated by automatic conditioning of the task hypertext. Where advisable, alternative locutions were generated, as in the example in Figure 2. For the most part, the language was generated automatically from the actual text of the inspection form. Only in the case of free-form inputs was a prespecified grammar used (see Figure 3, which also shows the use of non-terminals built into the language component).
Figure 2: Augmented link html used in SpeechWear.
Figure 3: Augmented FORM html used in SpeechWear.
While automatic processing is used to initially populate a document with language information, manual additions can also be made to a GRAMMAR clause to reflect arbitrary usage encountered in the field. By this means, the hypertext document can be updated to better approximate the language of the user population.
The above solution is not completely satisfactory as it requires modification of Web pages to make them ``speakable''. This is somewhat mitigated by the fact that pages can be automatically preprocessed to include the necessary information. In principle such processing could be done at the time of page retrieval, allowing the document to be modified without the need to preprocess it for inclusion of speech information. Such an organization would also allow for unrestricted navigation of documents available over the World Wide Web. In the environment we are considering, this would be of benefit, as it would allow the user in the field to consult a variety of sources, such as centrally maintained documentation or even specifications published by manufacturers, not all of which would (or should) be expected to have been preprocessed for the benefit of the speech-based user.
To allow complete automation, three operations need to be available: the conditioning of text into speakable form (e.g., transforming 35 into thirty five), establishing pronunciations for the resulting words and creating a suitable language model for the decoder. Such a protocol would be sufficient to support most forms of navigation, but might not be adequate for specifying language for certain FORM elements which (for efficiency) might benefit from manual specification, as in the example above (a large vocabulary language could always be attached implicitly to an input field). Presumably workable solutions could be developed for specific applications once the details are known.