A ToBI transcription for an utterance consists minimally of a recording of the speech, an associated record of the fundamental frequency contour, and (the transcription proper) symbolic labels for events on the following four parallel tiers:
Conventions are specified for both simple text-based transcription using this system and for waves(tm) label files and formats to accompany a speech file and associated time-aligned analysis records for the utterance. We first summarize the conventions assuming a computer-based labelling system such as waves(tm) label files and formats. A final section (Section 9), provided by Jacques Terken and Mari Ostendorf, summarizes the added guidelines for adapting the conventions to simple text-based transcription.
The Orthographic Tier
The orthographic tier will be used only for the transcription of orthographic words. In the waves(tm) label file, each word's orthographic form should be marked at the end of the final segment in the word, as determined by the labeller from the waveform or spectrogram record. That is, each orthographic word will be marked at its right `edge'. Individual transcribers will also determine whether and how to transcribe phenomena such as filled pauses (e.g., ``um'',``uh'') and whether to use contractions (e.g., ``gotta'') or not. There are several existing orthographic conventions for transcribing such phenomena, which labellers may want to consult. For example, the ATIS corpus conventions specify ``er'', ``mm'', ``uh'', and ``um'' as the allowable transcriptions for filled pauses.
The Break Index Tier
Break indices represent a rating for the degree of juncture perceived between each pair of words and between the final word and the silence at the end of the utterance. They are to be marked after all words that have been transcribed in the orthographic tier. All junctures -- including those after fragments and filled pauses -- must be assigned an explicit break index value; there is no default juncture type.
Values for the break index are chosen from the following set:
1 -- most phrase-medial word boundaries.
2 -- a strong disjuncture marked by a pause or virtual pause, but with no tonal marks; i.e. a well-formed tune continues across the juncture -- OR -- a disjuncture that is weaker than expected at what is tonally a clear intermediate or full intonation phrase boundary.
3 -- intermediate intonation phrase boundary; i.e. marked by a single phrase tone affecting the region from the last pitch accent to the boundary.
4 -- full intonation phrase boundary; i.e. marked by a final boundary tone after the last phrase tone.
For example, a typical fluent utterance of the following sentence:
might have a `0' between `Did' and `you' indicating palatalization of the /d j/ sequence across the boundary between these words. Similarly, the break index value between `want' and `an' might again be `0' indicating deletion of /t/ and subsequent flapping of /n/. The remaining break index values would probably be `1' between `you' and `want' and between `an' and `example', indicating the presence of a mere word boundary, and `4' at the end of the utterance, indicating the end of a well-formed intonation phrase.
In the waves(tm) break index label file, the number should be associated with a point in time at the end of each word, as indicated in the orthographic tier (Section 2). It should be located exactly at, or slightly to the right, of this word marker, so that break indices can be unambiguously associated with other tiers.
Transcriber uncertainty about break-index strength is to be indicated with a minus (`-') affixed directly to the right of the break index (e.g. `1-' to indicate uncertainty between `0' and `1'; `2-' to indicate uncertainty between `2' and `1'; and so on).
The full ToBI transcription must include both break index values and tone values. However, to accommodate backward compatibility with previously labelled databases or to allow intermediate stages in the labelling process, a partial ToBI transcription may have only break index values or only tone values assigned. Underspecification of break index values may be indicated by a value of `X' at the word boundary in the break index tier.
The Tone Tier
Phrasal tones will be assigned at every intermediate or intonation phrase:
L- or H- phrase accent, which occurs at an intermediate phrase boundary (level 3 and above); note that this represents a return to the notation in Pierrehumbert (1980)Note that, since intonation phrases are composed of one or more intermediate phrases plus a boundary tone, full intonation phrase boundaries will have two final tones, e.g.:
L% or H% (final) boundary tone, which occurs at every full intonation phrase boundary (level 4)
%H high initial boundary tone; marks a phrase that begins relatively high in the speaker's pitch range; the default initial boundary is in the middle of the range or lower, and will be left unmarked in the transcription. Transcribers should use %H only when a high pitch at the beginning of an utterance cannot be attributed to a H accent (H* or H+!H*) on the first or second syllable in the utterance (i.e., when the first word itself does not appear to be accented, or when its accented syllable occurs too far into the word to account for the initial H), and where the utterance contrasts with a possible rendition with a lower-pitched onset.
L- L% for a full intonation phrase with a L phrase accent ending its final intermediate phrase and a L% boundary tone falling to a point low in the speaker's range, as in the standard `declarative' contour of American English.For convenience, labellers may prefer to mark the tones at a break index with value `4' in a single step, with H-H%, L-L%, H-L%, or L-H%. We recommend that ToBI label menus include these symbols in addition to the separate symbols for phrasal and boundary tones described above; two additional symbols for downstepped phrase accent/boundary tone combinations will be described below in Section 4.3.
L- H% for a full intonation phrase with a L phrase accent closing the last intermediate phrase, followed by a H boundary tone, as in `continuation rise'.
H- H% for an intonation phrase with a final intermediate phrase ending in a H phrase accent and a subsequent H boundary tone, as in the canonical `yes-no question' contour. Note that the H- phrase accent causes `upstep' on the following boundary tone, so that the H% after a H- rises to a very high value.
H- L% for an intonation phrase in which the H phrase accent of the final intermediate phrase upsteps the L% to a value in the middle of the speaker's range, producing a final level `plateau'.
In the waves(tm) label file, the phrase accent and/or boundary tone associated with a phrase should be marked at a point at or just before the end of the last segment in the word ending the intermediate or full intonation phrase, and always before the related break-index mark; high initial boundary tones should be marked at the beginning of the phrase, where the H tone is observed and should always be located after the break-index marker for any preceding phrase.
Pitch accent tones will be marked at every accented syllable. Lack of pitch accent assignment for a syllable will be interpreted as meaning that the syllable is NOT accented. The ToBI transcription allows for the following five types of pitch accents. (Transcribers labelling utterances in dialects other than standard American English, standard Australian English, or RP British English may need to add additional types. These should be described in a general introduction to the transcribed database.)
H* `peak accent' -- an apparent tone target on the accented syllable which is in the upper part of the speaker's pitch range for the phrase. This includes tones in the middle of the pitch range, but precludes very low F0 targets. [Corresponds to H* and H*+L in Pierrehumbert's six-accent inventory.]In a waves(tm) label file, the pitch accent tone label should be placed within the nucleus of the accented syllable (i.e. the syllable that is phonologically associated to the starred tone of the accent), and always before the orthographic label and the break index mark at the end of the word.
L* `low accent' -- an apparent tone target on the accented syllable which is in the lowest part of the speaker's pitch range.
L*+H `scooped accent' -- a low tone target on the accented syllable which is immediately followed by relatively sharp rise to a peak in the upper part of the speaker's pitch range.
L+H* `rising peak accent' -- a high peak target on the accented syllable which is immediately preceded by relatively sharp rise from a valley in the lowest part of the speaker's pitch range.
H+!H* a clear step down onto the accented syllable from a high pitch which itself cannot be accounted for by a H phrasal tone ending the preceding phrase or by a preceding H pitch accent in the same phrase; should only be used when the preceding material is clearly high-pitched and unaccented. (Otherwise the accent is a simple !H*.)
If the F0 peak or valley for the starred H or L tone does not occur within the accented syllable, labellers who so wish may mark the early (or late) F0 event with `>' (or `<') pointing to the following (or preceding) pitch accent label. Thus, for example, if the F0 maximum for a L+H* occurs after the end of the accented syllable, a labeller may mark the time of the F0 peak with a `<' pointing back to the L+H* label.
Implicit in our discussion of the five pitch accents is the notion that H* is the `default' accent type. So, if there is any uncertainty about how low the F0 is before the peak, as in some cases of possible L+H* near the beginning of an utterance, the transcriber should mark `H*' rather than `L+H*'.
Downstep Diacritic for Pitch Accents and Phrase Accents
preceding the downstepped pitch accent peak or downstepped H phrase accent. Transcribers familiar with Pierrehumbert's full system should note that this eliminates the H*+L accent as a necessary downstep trigger within the system, since now the contrast between H* and H*+L will be marked by the absence versus presence of `!' on the following H tone.
Note that, since it is the H tone in each case that is affected by the downstep, the `!' diacritic should immediately precede the affected H tone in a pitch accent or phrase accent. Note also that this diacritic is NEVER applied to the first H tone in a phrase.
Some example uses of the downstep diacritic are:
H* !H- L% for the downstepped high phrase tone in the ``calling contour'' that in Pierrehumbert's original system was analyzed as H*+L H- L%
H* !H* L- L% for the ``staircase'' pattern that in Pierrehumbert's original system was analyzed as H*+L H* L- L%
L*+H L*+!H L*+!H for the succession of downstepped peaks that would occur in a succession of scooped accents
In light of our recommendations above that the possible tones of a level 4 break should be included as separate menu items, the possibility of the downstepped H phrase accent at full intonation phrase boundaries means that `!H-L%' and `!H-H%' should also be included as menu items.
Underspecification and Uncertainty
On the tonal tier, two kinds of uncertainty may be indicated: uncertainty over whether an event of a particular type has occurred, and uncertainty over the tonal value of an event that clearly has occurred. Thus, for example, the labeller may be unsure whether a particular syllable is accented, or, knowing that it is accented, may be uncertain of the accent type. Uncertainty of the first sort (whether the event has occurred) is indicated by `*?', `-?', and `%?' for pitch accents, phrase accents, and boundary tones, respectively. Uncertainty of the second sort (over the tonal value of a clearly occurring event) is indicated by `X*?', `X-?', and `X%?'. Thus, for example:
* means `This syllable is accented but the database does not yet have accent type transcribed.'
*? means `I'm not sure whether this syllable is accented or not.'
X*? means `I believe this syllable is accented but I am uncertain what accent type to assign.'
A typical case where `*?' might be used is for a very strong syllable in a part of an utterance between a prenuclear H* and a nuclear H*, where the F0 contour is flat and high because of the preceding and following tones, making it difficult to detect intervening H* accents. A typical case where `X*?' might be used is a part of an utterance where the labeller cannot tell whether an accent is a L* accent or a H* accent in a compressed pitch range.
These labels should be placed in the text transcription or in the waves(tm) label file to correspond as closely as possible to the temporal beginning and endings of the phenomena being described. So, a period of laughter plus speech might be indicated by marking the beginning and end of the laughter with:
Single comments in the misc tier such as `bad pitch track' or `disfl' (for `disfluency') are also allowable. However, whenever a misc comment refers to a region and not just a particular point in time, mark the beginning and end of the region. For example, if the pitch tracking algorithm has made an identifiable error in a particular region, such as pitch doubling or pitch halving, the transcriber should consider giving this more specific information in the usual paired event label format.
In general, it is the assumption of the participants in the common transcription group that silences should be automatically detectable, at least to a first approximation, and that transcriber time should not be spent marking these by hand. Disfluencies, by contrast, are not automatically detectable, and the absence of markings for them makes it difficult to parse the tone and break index tiers. For these reasons, transcribers are urged to mark disfluencies on the miscellaneous tier using `disfl>' and `disfl<' (or `disfl' if the disfluency is extremely localized), and to provide these marks in the miscellaneous tier menu when using waves(tm)). Since demarcating a disfluent region is considerably more difficult than merely recognizing its presence, the marks `disfl<' and `disfl>' (or simple `disfl') should be interpreted as rough pointers to the disfluent region and transcribers should not agonize over placing them precisely. Suggested conventions for further specification of particular types of disfluencies and their labels are provided in the ``Guidelines for ToBI Labelling''.
Redundancy Among Tiers
Files Associated with the Transcriptions
Since utterances will be recorded and transcribed at different sites, and for different immediate research purposes, it seems unlikely that we can arrive at any simple guidelines for such matters as sampling rate. We recommend adoption of formats compatible with other corpora insofar as possible.
Transcription Label Files
Each tier of ToBI should be representable in a simple text-based transcription, and as a separate label file in the waves(tm) label format. So, there will be separate label files for the orthographic tier, the break index tier, the tonal tier, and the miscellaneous tier. Such modularity allows partial transcription to be done and allows sites to add additional tiers as additional label files. All label files are of course aligned temporally via the waveform they label. This approach should also allow variation in display and access to different types of information. It is easy to provide software that supports labelling in such a format and that will generate summaries of prosodic information from such label files in a variety of formats.
Conventions for Non-waves(tm) Format
Each line contains a number of fields. Fields are separated by markers to facilitate extraction of information. The format is as follows.
than ^ $1 @500 ;this is so much comment about ;all kinds of things that it *eight ^h* $1 @600 ;continues on the next lines
w(*)ord ^tonal_marker $break_index @(time_stamp) time_stamp ;commentAn example of the fields of a non-waves(tm) transcription is shown below. (The neat organization in columns is purely for reading convenience and is not a requirement.) The waveform, F0 contour, and associated labels in a waves(tm) transcription are given in Appendix
Note: If you have waves(tm), you can get the example utterance and its F0 and label files to look at and play. To do this:
ftp portia.ling.ohio-state.edu login as anonymous, using your local user-id as your password. cd pub get ToBI.example.tar tar -xvf ToBI.example.tar