Guidelines for ToBI Labelling

Preface | Overview
More on the Tone Tier | More on the Break Index Tier
The Miscellaneous Tier | Bibliography
The ToBI Annotation Conventions
How this HTML document was created

    1. Overview and some basics

    1. The basic parts of ToBI

      A ToBI transcription for an utterance consists minimally of a recording of the speech, an associated electronic or paper record of the fundamental frequency contour, and (the transcription proper) symbolic labels for events arranged in four parallel tiers. (Other tiers can be added for the needs of particular sites -- see Section4.) The four tiers of labels, arranged in the order that they appear in the default labels window for the examples and exercises programs, are:

      1. a tone tier
      2. an orthographic tier
      3. a break-index tier
      4. a miscellaneous tier

      The tone and break-index tiers represent the core prosodic analysis. The tone tier is the part of the transcription that corresponds most closely to a phonological analysis of the utterance's intonation pattern. It consists of labels for distinctive pitch events, transcribed as a sequence of high (H) and low (L) tones marked with diacritics indicating their intonational function as parts of pitch accents or as phrase tones marking the edges of two types of intonationally marked prosodic units. The inventory of pitch events and their definitions are based on autosegmental analyses, in particular the analysis of Pierrehumbert and her colleagues (see Pierrehumbert & Hirschberg, 1990, and the references cited in it) with some modifications toward such alternative analyses as that of Ladd (1983). In example utterance <<jam1>>, there is a production of the question "Will you have marmalade, or jam?" with two pitch accents (the L* tones), two phrase accents (the H- tones), and a H% boundary tone.

      EXAMPLE <<jam1>>: Will you have marmalade, or jam? L* H- L* H-H% [GIF}

      The break-index tier marks the prosodic grouping of the words in an utterance by labelling the end of each word for the subjective strength of its association with the next word, on a scale from 0 (for the strongest perceived conjoining) to 4 (for the most disjoint). These categories of association strength, or `break indices' are based on work by Mari Ostendorf, Patti Price, Stefanie Shattuck-Hufnagel, and their associates (see, e.g., Price et al., 1991). We equate the two highest break indices with prosodic groupings that are marked intonationally. For example, break index 3 after the word "marmalade" in utterance <<jam1>> corresponds to the end of the intermediate phrase indicated by the H- phrase accent.

      EXAMPLE <<jam1>>: Will you have marmalade, or jam? 1 1 1 3 1 4 [GIF}

      The orthographic tier is arguably not part of any core prosodic analysis, except inasmuch as the labels on this tier can be used to interface the transcription to dictionary entries which do indicate such things as which syllable is likely to be most stressed in each word, prosodic information which is not otherwise included in the ToBI system. The orthographic tier is a straightforward transcription of all of the words in the utterance, in ordinary English orthography. When using waves(tm) and a transcriber script, or any similar computer labelling system, the convention is to align each orthographic label to the end of the word.

      The miscellaneous tier, like the orthographic tier, can include many events that are arguably not part of prosody per se. However, many events that are typically marked on this tier are important for interpreting the analyses on the tone tier and break-index tier, because they disrupt the smooth rhythm of the utterance or interrupt the intonation contour. This tier is essentially a `comment' tier that can be used to mark events such as the cough in example utterance <<cough>>. Except for very few exceptions (most notably, the label `disfl' often stands alone to flag the occurrence of a perceived disfluency of some type), labels on this tier come in pairs, to mark the beginning and end of each event interval. If it were not for the disruption of the cough labelled on the miscellaneous tier here, the tone transcription would have to be parsed as either unfinished or ill-formed.

      EXAMPLE <<cough>>: Will you have marmalade ... L* L* 1 1 1 1p cough< cough> [GIF}

    2. Guiding principles

      As should be obvious from the preceding examples, ToBI does not try to transcribe all aspects of prosody, or even all aspects that are amenable to symbolic transcription. In deciding what to include and what to leave out, we were guided by three principles. First, we wanted to be able to distinguish in our transcription all of the categorically distinct intonation patterns and prosodic units of the language (or rather of the three intonationally similar dialects that we claim to cover -- see Section 0.4 above). Second, we felt we should not transcribe aspects of prosody which are more amenable to quantitative measures than to the categorical divisions of a symbolic transcription. Finally, we did not want to squander the user's energies in transcribing even categorical aspects of prosody which are predictable from other parts of the transcription or from auxiliary tools such as dictionaries.

      The categorical aspects of prosody which we try to capture completely (by the first principle) are of two types. The first is the prosodic structure -- the rhythm of more and less stressed words alternating with each other, and the grouping of words into prosodic constituents of various sizes -- and the second is the intonation pattern -- the sequence of contrastive pitch events that we call pitch accents, phrase accents, and boundary tones.

      An example of the noncategorical aspects of prosody which we leave out (in accordance with the second principle) is the local tempo of each word in the utterance, which we feel could be more accurately and directly captured by some quantitative measure such as normalized segment duration (e.g., Campbell, 1992) than by any symbolic transcription such as an arbitrary division into, say, categories `1', `2', and `3' (for `slow', `medium', and `fast' tempi). An exception to this principle is the marking for each phrase of the point of highest fundamental frequency associated with an accent (HiF0), which we use as a measure of pitch range in order to facilitate research on the relationship between pitch range and discourse structure (see, e.g., Grosz & Hirschberg, 1992, and references therein). We anticipate being able to do away with this marking when we have developed automatic tools for detecting accent-related peaks directly from the fundamental frequency contour in conjunction with the tone tier transcription.

      A categorical aspect of prosody which we leave out (in accordance with the third principle) because it should be fairly predictable is the marking of the stressed and unstressed syllables within each word. By this level of stress we mean the word-internal alternation between more and less stressed syllables where the relative prominence of any pair of syllables is fairly fixed and can be thought of as inherent to the word's dictionary entry. For example, if the first and third syllables in the word "marmalade" are not pronounced with more prominence than the second, native speakers will judge the vowels in these two syllable to be mispronounced. (That is the first and third syllables should not have reduced vowels, whereas the second one should.) Since such word-internal rhythms are thus a fixed part of the word's pronunciation, we leave this specification out. That is, for example, in the transcription of utterances <<jam1>> and <<cough>>, we have not marked the first and third syllables as relatively more stressed than the second syllable, since this aspect of the prosodic structure would be marked in any dictionary entry for the word, so that users of ToBI-transcribed databases could interface the orthographic tier with an online dictionary to fill in this information.

    3. The marking of stress -- Pitch accents and prominence

      If the stress patterns within words are largely predictable from the dictionary entries for the word, what about other levels of stress? It has been recognized for some time now (e.g. Bolinger, 1972) that other aspects of the stress pattern cannot be predicted from the grammar with anything like the confidence with which we can predict the more stressed syllables within a word. Indeed the factors predicting the prominence of a word relative to other words in the same sentence is a matter of much current debate (see e.g. Hirschberg, 1993), and is one of the issues which we hope ToBI transcribed databases will be most useful in helping to resolve.

      Example utterance <<made1>> illustrates the unpredictability of prominences above the word, with three different productions of the same sentence -- "Marianna made the marmalade" -- each of which has a different stress pattern. In the first production, there are two syllables that are relatively more prominent than any other, the accented syllables in the words "Marianna" and "marmalade". In the second production of the sentence, on the other hand, there is only the one relatively more prominent syllable in "Marianna", and "marmalade" has been `deaccented'. This level of stress is marked in the ToBI system by directly transcribing the pitch accent on the tone tier. Thus, in the transcription of the first production in the example, there are H* accents marked for both "Marianna" and "marmalade", whereas in the second production there is only the L+H* accent marked on "Marianna". (The third production, like the first, also has accents on "Marianna" and "marmalade", but it has a different stress pattern because both of these accents are nuclear stresses, whereas in the first production only "marmalade" has a nuclear accent. We will describe this higher level of stress in more detail in the next subsection.)

      EXAMPLE <<made1>>: Marianna made the marmalade. in three productions 1) H* H* L-L% 2) L+H* L-L% 3) L+H*L-H% L* H* L-L% [GIF} [GIF} [GIF}

      Note that there is another difference between the first production and the last two: the second and third productions begin at a much lower fundamental frequency than the first. This is due to the distinction, marked on the tone tier, between a single-tone H* pitch accent and a bitonal L+H* pitch accent. This contrast is independent of the difference in stress pattern, which depends on the pattern of pitch accent PLACEMENT and not on the type of pitch accent. To see this, compare the first two productions of the sentence in <<made2>> with the second two productions. (These first two sentences are the same as productions (1) and (2) in <<made1>>.)

      EXAMPLE <<made2>>: Marianna made the marmalade. in four productions 1) H* H* L-L% 2) L+H* L-L% 3) L+H* !H* L-L% 4) H* L-L% [GIF} [GIF} [GIF} [GIF}

      The stress patterns are the same, but the choice of H* versus L+H* pitch accent type is the opposite. (For the relationship between the second pitch accent and the first in production (3) and the diacritic `!' that marks this relationship, see Section 2.8 below. The somewhat less low beginning in the third production is also dicussed in Section2.2.) The same stress patterns are illustrated again in the third and fourth productions in <<made3>> with yet another pitch accent type, this time a L* pitch accent (with a following rise into H- phrase accent and H% boundary tone).

      EXAMPLE <<made3>>: Marianna made the marmalade. in four productions 1) L+H* !H* L-L% 2) H* L-L% 3) L* L* H-H% 4) L* H-H% [GIF}

      In transcriptions using waves(tm) label files (or any similar computer labelling system), the stress that comes from associated pitch accents can be parsed from reading the tone tier, since the waveform is used to place the mark for a pitch accent somewhere in the syllable that is phonologically associated to the accent. In the non-waves(tm) transcription conventions, the stress is marked even more explicitly in the symbolic string, by putting an asterisk in the orthographic transcription just before the vowel of each accented syllable.

    4. The marking of stress -- Intonational phrasing and prominence

      Above the level of contrast between pitch-accented versus unaccented words, native speakers of English can distinguish another level of stress contrast, that between the last accented word of a phrase and any preceding accent. In the first production in utterance <<made1>>, for example, the word "marmalade" feels more prominent than "Marianna". In the last production of the sentence, on the other hand, "marmalade" does not feel necessarily more prominent than "Marianna". The sentence has been divided into two intonational phrases, so that each of these words is the last accented word in its own phrase. (This level of prominence is often called the `nuclear stress' or `nuclear accent' of the phrase.) Note that the level of prominence need not be marked explicitly, since the word with nuclear stress is defined positionally; it is the last accented word, or the accented word (if there is only one in the phrase). Thus the prominence contrast between a nuclear accent and a mere (prenuclear) accent can be read from the transcription of the accents on the tone tier relative to the boundaries marked between the phrases.

      EXAMPLE <<made1>>: Marianna made the marmalade. in three productions 1) H* H* L-L% 1 1 1 4 2) L+H* L-L% 1 1 1 4 3) L+H*L-H% L* H* L-L% 4 1 1 4

      There are two separate markings indicating the boundaries of an intonation phrase; one is the sequence of phrase accent and boundary tone on the tone tier, and the other is the 4 on the break-index tier. The break indices are numbered from 0 (for least disjuncture) to 4 (for most pronounced disjuncture). The numbering captures the hierarchical nature of these prosodic groupings. At the highest level of the break index hierarchy and at the next lower level, the sense of disjuncture between adjacent words is connected closely to the intonation pattern. The boundary after "Marianna" in the third production of the sentence in <<made1>> is one at the highest level in the break index hierarchy transcribed in ToBI. This level is marked tonally by a boundary tone (H% or L%) at its end (and sometimes at its beginning, too, in which case it is %H). The next lower level (break index 3) is marked by a phrase accent (H- or L-) at its end.

      An intonation phrase contains one or more intermediate phrases, and the end of an intonation phrase is by definition also the end of an intermediate phrase (break index 3). This fact is reflected on the tone tier in the requirement that there be a sequence of phrase accent (for the last intermediate phrase) followed by a boundary tone at the end of every intonation phrase. The last production of the sentence in <<made1>> illustrates this nicely with clear reflexes of the tone string in the fundamental frequency contour. Note first the fall from the peak for the L+H* nuclear pitch accent to the L- phrase accent for the first intermediate phrase, followed by the small rise in fundamental frequency to the H% boundary tone at the intonation phrase boundary.

      Utterance <<insert>> illustrates the next lower level of disjuncture, that between two intermediate phrases that are grouped into one intonation phrase. In the second production of the sentence "`I' means insert", there is a fall from a H* nuclear accent into a L- phrase accent, but there is no subsequent boundary tone, since this in not an intonation phrase boundary.

      EXAMPLE <<insert>> -- `I' means insert. in two productions 1) H* H* L-L% 1 1 4 2) H* L- H* L-L% 3 1 4 [GIF}

      Note that the first production of the sentence in <<insert>> contrasts with this second production in its stress pattern in the same way as the first and third productions of <<made1>>. The notion of nuclear accent is defined relative to the intermediate phrase. The contrasting productions in <<made4>> illustrate the same contrast in one versus two nuclear accents with L* pitch accents and a H- phrase accent at the boundary between the two intermediate phrases in the production with two nuclear accents. (The *? on the "made" in the first production illustrates a very common type of ambiguity about accent placement that is discussed below in Section 2.9.)

      EXAMPLE <<made4>>: Marianna made the marmalade. in two productions 1) L* *? L* H-H% 1 1 1 4 2) L* H- L* H-H% 3 1 1 4 [GIF}

    5. What lines up with what?

      The conventions for placing labels when using the waves(tm) labelling system are prescribed in the ToBI Annotation Conventions so that labellers can use tools such as John Pitrelli's checker program to check for inadvertent omissions and grammatical errors. To quickly summarize, the break index label is placed at or just after the word label. Phrase accent and boundary tone labels are placed on or just before the corresponding 3 or 4 break index label. Pitch accents are placed somewhere within the accented syllable, preferably within the interval that can be identified with the syllable's vowel.

      In the non-waves(tm) transcription conventions, the orthographic, tone, and break index labels are ordered within each line so that such a transcription could be generated fairly quickly by merging and sorting a set of waves(tm)-format label files.

    labelling_guide_v2.ASCII (augmented by some HTML)