School of Computer Science
Carnegie Mellon University
Pittsburgh PA 15213
This paper describes a new speech system evaluation process centered on the decoding of broadcast news. Hub 4 represents a departure from traditional ARPA performance evaluation. Most notably it moves beyond an evaluation protocol in which the composition of the test materials as well as the materials that participants are allowed to process in preparation for the evaluation are closely controlled. The resulting paradigm allows participants to more freely explore the space of possible solutions, while focussing attention on problems more representative of potential applications.
Traditionally, ARPA has been interested in detailed evaluation of speech recognition technology that allows it as well as the wider speech community to understand how new techniques improve our ability to decode speech. Since the focus was on understanding the evolution of techniques, effort was expended on ensuring that systems were compared on an equal footing. In addition to fixing the obvious aspects of evaluation (such as the test set and the procedures for generating and submitting evaluation results), the evaluation process fixed the inputs into systems, specifically the acoustic training data and (for a time) the language models. Evaluation design decisions reflected community consensus on what aspects of system implementation were of greatest interest in developing core recognition technology [Kubala et al, 1993].
More recently, interest has shifted to other aspects of performance, those that underlie the adaptability of the technology to different domains. The desire was to push performance along the dimensions of building systems that can be rapidly deployed to new domains, that do not rely crucially on control of the circumstances of signal acquisition. and that are able to process speech under time-bounded circumstances. One outcome of this shift was the development of the Hub 4 protocol.
The purpose of this paper to document the process by which Hub 4 was defined and to provide an understanding of how various design decisions were made and how these supported the idea of this evaluation.
2. Designing a new evaluation
Hub 4 was designed with the help of a working group, whose members are listed in Table 1. Since this was to be a new evaluation featuring a new paradigm that represented a departure from previous practice, an effort was made to minimize the degree to which the design reflected preconceptions about the structure of the task. An effort was also made to have the evaluation task modify as little as possible the properties of the evaluation domain.
The specifications for the evaluation were developed fairly rapidly (see Table 2) and were frozen within two months of the initiation of the process. The process was expedited by preliminary work that had been done by others to identify a suitable source of broadcast material (Marketplace1), obtain the necessary permissions for use of the material and to collect a sample of data, 10 programs, which then became the training and development sets for the evaluation.
|Alexander Rudnicky (Carnegie Mellon) Chair
|Nina Yuan, Francis Kubala (BBN)
|Stephen Wegmann (Dragon Systems)
|Ponani Gopalakrishnan (IBM)
|Mike Hochberg, Gary Cook (Univ. of Cambridge)
|Ralph Grishman (New York University)
|Esther Levin (AT+T Bell Labs)
|George Doddington (U.S. Govt) consultant
|Dave Pallett (NIST) consultant
|May 15: Hub 4 announced; working group formed.
|May 30: Working group holds first conference call.
|July 19: Final specification agreed to.
|November 1: Evaluation set distributed.
|November 13: End of submission period.
2.1. Specification for Hub 4
To better understand why this evaluation was structured as it was, we will examine the specification in detail and comment on the reasons behind the various design choices. The text of the specification is presented in italics, followed by commentary.
The purpose of Hub 4 is to encourage research into those aspects of speech recognition technology that make for nimbleness, the ability of systems to adapt to varying conditions of input, whether in acoustic characteristics or content. Equally, Hub 4 will focus on the problems of processing "found speech", that is, speech materials which have not been created specifically for the purpose of speech system development or evaluation. In other words, Hub 4 is meant to promote research on ADAPTATION to changing conditions and on ROBUSTNESS with respect to degradation.
Hub 4 introduces two ideas: "nimbleness" and "found speech" that distinguish it from previously evaluated domains. Nimbleness is encouraged by not providing participants with extensive samples of data from the target domain (in this case Marketplace broadcasts). The intent was to encourage the development of systems that perform well despite a lack of specific information about a domain. This was somewhat eased by allowing participants access to similar material ("business news"), but restricting access to Marketplace itself. A secondary consideration was the desire to ensure continuity with the domain used in previous evaluation protocols (i. e., Hubs 1 and 2) which focused on Wall Street Journal and newswire sources, both text. In practice, participants made extensive use of the development data provided (10 shows) and attempted to tune their systems to the characteristics of this sample. In this sense, nimbleness was not achieved by the systems submitted to the 1995 evaluation. Interestingly, most sites adopted the (common-sense) strategy of partitioning the data according to major acoustic characteristics (e.g., studio speech, reduced-bandwidth channel, speech superimposed on music, etc.) then making use of different decoding strategies for each.
The second key attribute of this evaluation is the use of found speech. Traditionally, speech systems have been evaluated using speech materials created specifically (and exclusively) for purposes of evaluation. Material was sampled from a known pool of utterances, then speakers (themselves sampled from a specified population) recorded these utterances with the requirement that the rendition be fluent and accurate. Insofar as this permitted accurate evaluation of recognition technology without interference from extraneous factors, it was a success. At the same time this process shielded systems from a variety of phenomena that occur in the spontaneous production of speech by speakers sampled from a general population2. Ultimately, applications of this technology were more likely to involve the latter type of speaker.
In terms of traditional categories of approach, Hub 4 focused on the ability of systems to adapt to a domain on different levels (i.e., acoustic and language) and on their robustness, that is, the ability to minimize variance in performance as a function of input attributes.
2.1.1. Test Set Composition
The test set should consist of approximately 90 minutes of Marketplace material sampled from the period 1-31 August 1995. The materials will be obtained by NIST directly from the producers (KUSC) and should be comparable in acoustic quality to the archival materials made available for development purposes. The test set will consist of two parts: 1) a complete show. 2) a selection of segments (two "heads" and two "tails"), chosen according to the following criteria:
1. No segment should come from the show selected as the "complete broadcast" portion of the test set.
2. Each segment will come from a different show and will be sampled randomly from the available data.
3. A "head" is a contiguous segment taken from just after a show's introduction (i.e., starting with the first story) and continuing to the end of the last story before the market report (i.e., that story introduced as "the numbers"). Story is defined as in NIST's transcription document for this domain.
4. A "tail" is a contiguous segment taken from the start of the first story immediately following "the numbers" and continuing to the end of the last story in the show (thereby dropping the closing segment of a show).
There is always a temptation to redefine problems in terms that will provide information that is of maximal use to system developers. For example, for a given amount of evaluation data, maximally useful information will be provided if the data are split equally between all conditions of interest (e.g., noisy speech, clean speech, etc.) This is of course the correct strategy; unfortunately it suffers from the problem that we define a priori what the interesting dimensions of performance might be, and in the process miss other possibly more important problems. The Hub 4 test set composition sought to bypass this problem by considering entire broadcasts to be the unit of evaluation (early suggestions proposed that the set be composed of defined samples of different types of speech, to ensure that meaningful samples of each type were available). Despite this, a compromise was made in the interest of increasing the sample over different broadcasts, to minimize any impact of unusual show characteristics. This was achieved by splitting shows in half and eliminating certain Marketplace boilerplate material. Ideally, the test set would have been composed of (say) five different complete broadcasts, to better sample Marketplace (which indeed displays periodic variation on a weekly cycle). In the end, considerations of limiting the total amount of material that sites needed to process for the evaluation were acknowledged and the current composition was chosen.
2.1.2. Development data
NIST will make available to all prospective participants the following materials for development purposes:
1. Ten complete Marketplace shows, together with a transcript. (The "training" data.) The transcript will follow the specifications developed and distributed by NIST (transpec.doc). The data will be sampled from the period predating May 15, 1994.
2. Two development test sets composed according to the criteria described for the evaluation test set, but sampled from the development test epoch, 15 May--15 June 1994.
The intent was to limit the amount of domain-specific data, so as to discourage approaches that relied on training or tuning models to the data. In reality most sites performed some type of training using the data provided. In retrospect this was a reasonable strategy to pursue; more effective alternative approaches are difficult to think of. Nevertheless, the development of such a capability would seem crucial to the long term utility of speech recognition technology.
It is understood that sites will not independently acquire or use domain material (i.e., Marketplace broadcasts). Beyond this restriction, sites are free to make use of whatever materials they deem best suited for improving performance on this task. As part of such material, NIST will arrange to have LDC make available, through means deemed acceptable to NIST and this committee, ongoing access to a newswire feed, with no more than a two-week lag. Archived material covering the period from spring 1994 to the present will also be made available, to ensure temporal continuity between previously published materials (i.e., CSRNAB) and the present evaluation period.
In past evaluations, the training materials provided to evaluating sites were tightly controlled by restricting the sanctioned material to a common corpus that was made available to all participants and by further requiring sites to make available to all any additional materials that they might chose to use. For Hub 4 we attempted to minimize this control by allowing sites to make use of whatever material they deemed appropriate. The only restriction was that sites could not make use of material that in some way could not, in principle, be obtained by others. Nevertheless, a basic source of text data was provided to establish a baseline language training corpus. In practice, most sites relied primarily on this source, though they might have explored alternate sources such as broadcast transcriptions commercially available through such vendors as Journal Graphics, Inc.
Since Hub 4 was an entirely new form of evaluation, allowing sites to freely explore all possible sources of useful information was the best way to understand what would work best for this class of domain. Certainly it seemed counterproductive to a priori limit the scope of this material.
2.1.3. Conditions of evaluation
There is one primary condition ("P"), with any approach allowed, including running a decoder in transcription mode. There is no side information available, other than show boundary (for part 1 data) and segment boundary (for part 2 data). This implies that techniques such as supervised adaptation will not be allowed.
Sites are encouraged to report results for contrasts that they deem will best quantify the contribution of the approach incorporated into their primary system. However, no official contrasts are specified nor are any required by the evaluation protocol.
Previous evaluation protocols included a variety of conditions (in addition to the primary "do the best you can"). The purpose of the additional so-called "contrast" conditions was to allow participants to assess the contribution to performance of various system components. In essence evaluations were conducted as formal experiments, each with a set of jointly agreed to conditions. Such an approach is highly productive in an environment where the nature of the problem is well understood and participants agree on which dimensions of performance are most diagnostic of the efficacy of different design choices. In the present case, given that the problem was new and that no existing understanding of the problem had been articulated by the community, the best thing to do turned out to be to allow sites to decide on their own which contrasts would be most informative. In practice, most sites reported on contrasts that reflected their particular interests or areas of research.
Sites will generate decodings that include word time alignments, so that the LVCSR scoring algorithm can be used for this evaluation. Word error will be the primary metric. Upon request, NIST will make scoring software available in a timely manner to participants.
An existing scoring approach was adapted to the current evaluation, as the necessary tools and conventions had already been worked out. Not mentioned in the specification was the use of another mechanism, adjudication, a process by which sites were allowed to challenge NIST's scoring on particular utterances. Adjudication catches transcription errors as well as ambiguities in the transcription conventions. In practice, it serves to "improve" the scores that systems can report. In contrast to previous evaluations, little adjudication took place for Hub 4.
Separate scores will be computed for each part of the evaluation set and will always be clearly separated in all accounts of the evaluation. NIST will tabulate for each part both a global score and a breakdown according to categories deemed of interest to the community. Categories of interest will be specified as we acquire experience with the domain. NIST will chose a initial list of such, based on its own experience with the domain and in consultation with the H4 committee.
Distribution of the results of the evaluation will be governed by the following understanding:
None of the test results for this evaluation may be published, quoted, or otherwise disseminated without explicit permission of the concerned site(s).
As Hub 4 was a new form of evaluation, no clear sense of how data should be reported was developed at the time of specification. The final approach was to partition the data along reasonable dimensions, such as channel characteristics, speaker characteristics and by individual speakers [Pallett et al, 1996, this volume]. At this point in time, it appears that some of the partitioning dimensions used in reporting Hub 4 data will be come the basis for allowing sites to selectively participate in the evaluations (say by choosing to evaluate on only one category of speech).
An informal attempt was also made by sites to report on the computational resources consumed by the evaluation, though the heterogeneity of devices and procedures made it difficult to evaluate these data.
Hub 4 represents a break with previous evaluation methodology. In a sense it represents the same kind of movement that took place when the field moved from the 1000 word Resource Management task to the moderate then large vocabulary versions of the Wall Street Journal task. Each step represented a progressive loosening of the constraints on the evaluation task and brings into focus a new set of research issues. Broadcast News promises to do the same.
Kubala, F., Bellegarda, J., Cohen, J., Pallett, D., Paul, D., Phillips, M., Rajasekaran, R., Richardson, F., Riley, M., Rosenfeld, R. Roth, B. Weintraub, M. The hub and spoke paradigm for CSR evaluation. Proceedings of the 1994 Spoken Language Technology Workshop San Francisco: Morgan Kaufmann, 1994, pp.9-14.
Pallett, D., Fiscus, J. Garofolo, J., Przybocki, M. 1995 Hub-4 "dry run" broadcast materials benchmark tests. Proceedings of the 1996 Speech Recognition Workshop, Arden House, 1996.
2. A separate evaluation activity, centered on spoken language systems did attempt to address this issue by designing a protocol in which speakers were allowed to generate their own utterances, constrained only by an initial task gaol ("scenario").