    2. More on the tone tier

    1. Tones and fundamental frequency

      As noted above, one of the basic parts of a ToBI transcription for an utterance is an electronic or paper record of the fundamental frequency contour. The transcription of events on the tone tier is closely linked to this record. In the case of pitch accents, the labeller can make this link explicit by choosing to place the label for the pitch accent specifically at the f0 maximum or minimum that realizes the starred tone of the accent, if this f0 event is within the interval of the accented syllable nucleus. (If the maximum or minimum does not actually occur within the syllable nucleus, there are optional conventions for marking the maximum or minimum as well as the accented syllable using the symbols `<' and `>', for a late or early f0 event, respectively -- see Section 4.2 in "The ToBI Annotation Conventions".) There is a more practical connection, as well, inasmuch as most transcribers find the fundamental frequency contour an invaluable aid in making the analysis of the intonation pattern that is embodied by the transcription on the tone tier.

      In interpreting the f0 contour to make the tonal transcription, it is important to keep in mind that several non-tonal aspects of an utterance can also strongly influence the fundamental frequency pattern. One of the most ubiquitous of these influences is the way in which consonant segments in the utterance interrupt the smooth course of the f0. Voiceless stops such as [p] and [t] and voiceless fricatives such as [f] and [s] create `holes' in the f0 contour just by being voiceless. Moreover, it is not possible usually to read the intended pitch during a voiceless consonant by interpolating from the last f0 value before voice offset to the first f0 after voice onset because obstruent consonants (stops, fricatives, affricates) all cause dramatic perturbations in the fundamental frequency contour over and above any interruption of voicelessness per se. As an `intrinsic' characteristic of its voiceless specification, a voiceless obstruent is usually associated with a dip into the consonant constriction and a dramatic fall starting from a much higher frequency just after the consonant release. Even voiced obstruents disturb the f0 contour; a voiced stop or fricative can be associated with a fall into and rise out of an often quite-deep valley during the consonants constriction. Utterance <<blond-baby1>> illustrates some of these effects. There is a dip in the f0 around 1.9 s into the file for the [d] at the beginning of "difference" and the sharp fall around 5.29 s right after the [p] in "pink". (To be sure, the perturbation caused by the [p] here is very small compared to many cases of voiceless obstruents that we have seen.)

      EXAMPLE <<blond-baby1>>: what's the difference among my long memory H* !H* L-L% L+H* !H* H-H% your blond baby and the pink carpeting L+H* *? !H* L-H% L* L* H* L-L% [GIF}

      In interpreting such `intrinsic' segmental effects, it is important to note the actual voicing of the consonant, and not simply its phonemic status. For example, phonemically voiced obstruents in stressed syllable initial position for many speakers are not always really voiced. Note, for example the /b/ of "blond" at about 3.95 to 4.0 s in <<blond-baby1>>, which is voiceless unaspirated and has f0 perturbing characteristics more like those of the /p/ of pink. Also, the consonant /t/ in American and Australian English is usually a voiced flap (a short [d]-like segment) when it begins an unstressed syllable, as in the /t/ of "carpeting" in example utterance <<flap2>>. Similarly, /h/ is often voiced between vowels. Thus, the perturbation caused by these two phonologically voiceless consonants is often like that for a /d/ or a /v/, rather than like a true [t], as shown by the /h/ in example utterance <<voiced-h>> (at around 3.04s).

      EXAMPLE <<flap2>>: The pink carpeting. [GIF} H* H* L-L%
      EXAMPLE <<voiced-h>>: Give him a hand with that. H* L-L% [GIF}

      Example utterance <<flap>> gives another environment where flapping is common; see the flapped /t/ across the word boundary at around 1.34s. The flapping here is also important for transcribing break indices (see Section 3.2).

      EXAMPLE <<flap>>: Don't hit it to Joey. H* L*+!H L-L% [GIF}

      Another kind of problem in interpreting the f0 contour comes from shifts into voice qualities other than normal modal phonation. For example, for most speakers, subglottal pressure falls very sharply at the very end of an utterance. If the cross-glottal pressure difference becomes very weak, there may no longer be good glottal closings -- i.e. the phonation may become quite breathy -- so that even fairly robust pitch-tracking algorithms can easily fail. For some speakers this switch to breathy voice might happen even earlier if the utterance has a long low-pitched stretch corresponding to a L- phrase accent. Or, a speaker might break into creaky voice in such a region. In fact, many speakers break into creaky voice in almost any region with very low fundamental frequency. Since creaky voice is typically characterized by very irregular glottal periods (i.e., the fundamental frequency is physically not well-defined), pitch-tracking algorithms often do not do well during these portions of the utterance, creating a messy `spattering' of values, like that seen in the f0 trace between 4.95 and 5.08 seconds in <<jam2>>. Here the creak is due to the L*.

      EXAMPLE <<jam2>>: Will you have marmalade, or jam? L* H- L* H-H% [GIF}

      The pitch tracker can also completely fail, and give no f0 values at all, as in the region of the L- in the second production of <<made1>> after about 3.4s.

      EXAMPLE <<made1>>: Marianna made the marmalade. second production 2) L+H* L-L% [GIF}

      In these two examples (as in many other occurrences of the same tone types in many of the example utterances in this labelling guide), the creaky voice is reliably interpreted by native speakers as a very low pitch value for some low tone. However, creaky voice does not automatically mean a very low L tone. Creaky voice can also occur as one common manifestation of a glottal stop, a segment which in English often occurs phonetically as a way to set off a word beginning with a stressed syllable that has no onset consonant. For example, the word "airline" in <<glottal-stop>> begins phonetically with a glottal stop realized as creaky voice.

      EXAMPLE <<glottal-stop>>: And set training and experience standards H* H* H- H* H* L-L% for airline inspectors and mechanics. H* H* L- H* L-L% [GIF}

      Nor are breathy voice and creaky voice the only source of pitch-tracking errors. Even in parts of the utterance with normal modal voicing, pitch-tracking algorithms can sometimes go wrong because of fluctuations of amplitude or because of the vowel's resonance characteristics. A perfectly ordinary period-to-period oscillation in amplitude can cause a halving of the estimated fundamental frequency value, as illustrated in the region between 4.8 and 4.93 seconds and again between 5.3 and 5.45 seconds in example utterance <<pitch-halving>>. (Compare this to <<no-pitch-halving>>, which is exactly the same utterance, pitch-tracked with somewhat different assumptions about the signal parameters which the pitch-tracking program uses in its consistency-checking algorithm.) Or, if the first formant is much higher than the fundamental, the pitch tracking program might take the amplitude of harmonics that it amplifies as an intervening glottal pulse, effectively doubling the pitch, as in the region between 14.07 and 14.18 seconds in example utterance <<pitch-doubling>>. Transcribers must therefore learn when to trust their ears to catch such misparsings in the fundamental frequency track (or to use an alternate record of the fundamental frequency contour, such as the narrow-band spectrogram). When all of these perturbing effects are taken into account, however, the fundamental frequency contour becomes a valuable aid in transcribing the events on the tone tier.

      EXAMPLE <<pitch-halving>>: Jim builds a big daisy-chain. H* H* L-L% [GIF} EXAMPLE <<no-pitch-halving>>: Jim builds a big daisy-chain. H* H* L-L% [GIF}
      EXAMPLE <<pitch-doubling>>: Then I don't know if I can explain H* L+H* it to you. L-L% [GIF}

    2. Some familiar contours, and the contrast between H* and L+H*

      The inventory of events that are transcribed on the tone tier are five pitch accents, two phrase accents, and two boundary tones (plus downstepped counterparts of pitch accents and phrase accents with H tones). The summary statement in Appendix A lists the symbols for all of these tones and defines their use. In the previous sections, we already illustrated several familiar intonation patterns involving these tones. For example, the first productions of the sentences in example utterances <<made1>> and <<insert>> were instances of the `declarative contour' which is an intonation phrase containing one or more H* pitch accents and ending in a sequence of L- phrase accent and L% boundary tone -- i.e. (H*) H* L- L%. (When there is more than one accent, particularly when there is one relatively early and one relatively late accent, this contour is often called the `hat pattern'.) The last production in utterance <<made1>> illustrated a sequence of L- phrase accent followed by H% boundary tone that is sometimes called the `continuation rise'. The first production in utterance <<made4>> was an example of the `yes-no question contour', consisting of one or more L* accents followed by a H- phrase accent and H% boundary tone -- i.e. (L*) L* H- H%.

      The productions in <<made1>> and the first two productions in <<made3>> also illustrate one of the more difficult contrasts in pitch accent type -- that between the two types of `peak accent' in which the peak is timed to occur on the accented syllable (H* versus L+H*). These two pitch accents are alike in that both have high fundamental frequency targets timed to occur on the accented syllable. They are alike also in that the actual timing of the f0 peak that realizes the high tone can vary depending on the phonetic length of the syllable and on the neighboring tones. In longer syllables just before a L- phrase accent, the peak tends to come fairly early in the syllable, whereas in short syllables with no immediately following tone target, the peak for the high tone can be quite late, sometimes after the actual acoustic end of the syllable. This is illustrated in the hat pattern utterance in <<word1>>. The peak for the high tone of the first H* on "word" comes rather late (in the last third of the syllable), whereas the peak for the high tone of the second H* comes very early in "word" before the L- low tone target (during the first quarter of the syllable). How then do the two pitch accents differ?

      EXAMPLE <<word1>>: Your word is your word. H* H* L-L% [GIF}

      The essential difference is what happens before the high tone. The leading L tone in L+H* is meant to transcribe a rise from a fundamental frequency value low in the pitch range that cannot be attributed to a L* pitch accent on the preceding syllable or to a L- phrase accent or L% boundary tone at a preceding intermediate-phrase or intonation-phrase boundary. For H*, by contrast, there is at most a small rise from the middle of the speaker's voice range (unless, of course, the H* follows soon after some low tone such as a L* pitch accent or L- phrase accent). Example utterance <<won>> is a minimal pair illustrating this contrast.

      EXAMPLE <<won>>: Marianna won it. in two productions 1) H* L- L% 2) L+H* L- L% [GIF}

      In the English intonation system as described by Pierrehumbert & Hirschberg (1990), H* and L+H* have distinct meanings, which make the latter more likely to occur in a contrastive context such as the one evoked by the second production of the sentence in <<made1>>. In theory, this contrast between H* and L+H* can occur anywhere within a phrase. However, the distinction is difficult to make when the accented syllable is the first in the utterance, as in the second production of the sentence in <<anna>>. These three productions are examples of almost exactly the same patterns as exemplified by the three productions in <<made1>>. However, because the word "Anna" has no unstressed syllables before the main stressed one, it is difficult to realize the low tone for the nuclear accent on the first word in the second production. In cases such as this, where the evidence for L+H* comes from (theory-dependent) intuitions about meaning rather than from any clear low pitched region in the fundamental frequency contour, the ToBI Annotation Conventions prescribe H* instead. (The *? on the "married" in the first production illustrates a very common type of ambiguity about accent placement that is discussed below in Section 2.9.)

      EXAMPLE <<anna>>: Anna married Lenny. in three productions 1) H* *? H* L-L% 2) H* L-L% 3) H* L-H% L+H* L-L% [GIF}

      Even when there is a long enough stretch between the beginning of the utterance and the accent, L+H* can be difficult to distinguish from H* because the categorical distinction in meaning is not always matched by a categorical distinction in the f0 level of the low tone. (The mapping of phonetic continua onto discrete oppositions is a well-known problem in segmental phonology as well.) Utterance <<made2>> above illustrates this. The L tone of the L+H* in the third production is not so low as that in the second production. When such utterances are taken out of context, it is possible for even intonational experts to be confused, and in fact, another transcriber with long experience in transcribing English pitch accents questioned our transcription of this as L+H*. (We are confident in the transcription, and did not mark it as X*? -- see Section 2.9 below -- but only because we know the context.)

      The last productions in <<made1>> and <<anna>> are very similar to another type of contour where one needs to be especially careful in choosing between H* and L+H*. In both of these sentences, the nuclear stress for the second intonation phrase occurs late enough that the low-pitched region of the L+H* (nuclear) pitch accent could be distinguished even if there were no H% boundary tone intervening between the L+H* pitch accent and the L- phrase accent for the preceding phrase. In the very similar contours of example utterances <<noone>> and <<for-marianna>>, on the other hand, there is no H% boundary tone, and one must play close attention to the timing in order to decide whether the accent in the second phrase should be transcribed as H* or L+H*. (Note that the first utterance in <<for-marianna>> also probably illustrates grouping at the level of the intermediate phrase and not a full intonation phrase; see Section 2.4 for the difficulty of telling these levels apart in this context).

      EXAMPLE <<noone>>: But Marianna knows noone. L+H* L-L% L+H* L-L% [GIF}
      EXAMPLE <<for-marianna>>: 1) That one's for Marianna. H* L- L+H* L-L% 2) Give me the brown one for Marianna. H* H* L-L% H* L-L% [GIF}

      The first response alternative in example utterance <<mother4>> illustrates another idiomatic intonation contour which might be confused with L+H*. This is the `surprise-redundancy' contour described by Sag & Liberman (1975). Here the preceding low pitched region comes from a L* pitch accent on a prenuclear accented word. The second response alternative shows the subtle way in which this rising sequence differs from L+H*. The simple interpolation from the L* to the H* is more gradual than the steep rise within the L+H* accent, although the difference can be very subtle when there are only a few syllables between the two accents in the L* H* sequence, as it is here.

      EXAMPLE <<mother4>>: Who's it for? Mary's mother. It's for Mary's mother. L* H* L-L% L* H* L-L% *? L+H* L-L% [GIF}
    3. The timing of the phrase accent and "upstep"

      The examples so far also illustrate two important points about the phrase accent. First the phrase accent is unlike the boundary tone in that it is not necessarily localized at the phrase edge. Rather, when the nuclear accent is far from the end of the intermediate phrase, the phrase accent fills in the space in between it and the phrase edge, creating a long flat valley for L- realized over a long stretch, as in the first speaker's production in example utterance <<names>> (the region between 11.06 and 11.64 seconds), or a long plateau-like region for H- realized over a long stretch, as in the second speaker's production in example utterance <<names>> (the region between 13.3 and 13.9). Second, the H- phrase accent triggers an "upstep" (a local raising of the pitch range to the end of the phrase), so that a following H% boundary tone is realized as a second rise at the end of the plateau-like region. This second point can also be seen clearly in example utterance <<names>> starting at 13.9 seconds. Compare the much lower f0 target for the H% boundary tone in example utterance <<names>> at 11.78, where the H% occurs after a L- phrase accent and therefore is not realized in such an upstepped pitch range.

      EXAMPLE <<names>>: Anna may know my name, and yours too. Anna may know our names? H* L-H% H* H* L-L% L* H-H% [GIF}

      (Some experienced transcribers may want to call the pitch accent on "Anna" in the first sentence a L+H*. We remind them of the annotation guidelines that say in effect "Whenever there might be any doubt whatsoever, such as on the absolute utterance initial syllable, choose H* rather than L+H*.")

      The summary statement on ToBI conventions prescribes that, in a waves(tm) label file, the phrase accent (or phrase accent and following boundary tone) should be marked at a point at or just before the end of the last segment in the word ending the intermediate phrase (or full intonational phrase) and always before the related break-index mark. The conventions say that the phrase accent should be placed here even when the nuclear accent occurs quite early and the phrase tone is realized over a long period of time, as in these two example utterances.

      Note that when the nuclear accent is close to the end of the intonation phrase, it is impossible to discern any inflection point between the high f0 target for the H- phrase accent and the even higher f0 target for the upstepped H% boundary tone. The upstepped boundary tone after "jam" in example utterance <<jam1>> illustrated the smooth single rise that results in this case.

      Example utterance <<money>> illustrates the full paradigm of combinations of phrase accent and following boundary tone that can occur at the end of an intonation phrase. Note that because of the upstep of the pitch range after the H- phrase accent, the L% boundary tone of a H- L% sequence does not have an absolutely low f0 target, just a lower one than that of the upstepped H% boundary tone. The contrast between a H- L% and a H- H% sequence is particularly salient when the preceding nuclear pitch accent is H*, as in the two sentences in example utterance <<name1>>.

      EXAMPLE <<money>>: 1) Is that Marianna's money? H* H* L-H% 2) That's Marianna's money. H* L-L% 3) That's Marianna's money. H* H-L% 4) Is that Marianna's money? L* L* H-H% [GIF} [GIF}
      EXAMPLE <<name1>>: My name is Marianna. in two productions 1) H* H-H% 2) H* H-L% [GIF}

      The productions in <<money>> of the two combinations with final L% boundary tone illustrate another potential difficulty, and highlight the importance of listening to the speech and not just looking at the f0 record when doing the intonational analysis for the tone tier transcription. When an intonation phrase is not the last one in an uninterrupted stretch of speech and it ends with a L% boundary tone, it is difficult to distinguish from an intermediate phrase ending with the corresponding phrase accent just by examining the f0 contour. That is, the pitch differences between L- L% sequence and a mere L-, or between a H- L% sequence and a mere H-, are very subtle at best. Here the transcriber must rely on the subjective sense of degree of disjuncture, which is probably cued by such other things as the amount of preboundary lengthening or the degree of final lowering in the case of L- L% versus L-. (Note that the difference here must be transcribed also on the break index tier -- Section 3.) Example utterances <<park2>> and <<oregano>> illustrate this difficulty. In <<park2>>, looking just at the f0 contour, we might have L- L% and a full intonation phrase boundary or just L- and a mere intermediate phrase boundary between the nuclear H* pitch accents on "probably" and "pleasantest". The ambiguous durational cues (two experienced transcribers argued even over whether the boundary should be before or after the "the") supports the notion that this must be the latter, a mere intermediate phrase boundary. By contrast, the strong sense of pause (caused by the lengthening on "and"?) in the tonally identical stretch between the accents on "shortest" and "probably" support a full intonation phrase. In <<oregano>> the two productions contrast the L* H- intermediate phrases typical of a list in the first production with the ambiguous H* H- or H* H- L% plateaus of the second.

      EXAMPLE <<park2>>: Definitely the shortest and probably the pleasantest H* L- H* L-L% H* L- H* way to go is through the park. L- L+H* L-L% [GIF}
      EXAMPLE <<oregano>>: 1) Let's see I need oregano 'n marjoram 'n some H* H* L-L% L* H- L* H- fresh basil okay? L+H* !H* L- H* H-H% 2) Oh I don't know it's got oregano 'n marjoram H* !H* !H* L-L% H* H- H* H- 'n some fresh basil. H* H-L% [GIF}

      The f0 patterns on "oregano" and "marjoram" illustrate also another difference between the H- and L- phrase accent, particularly in the contexts of unlike tones on the preceding nuclear accent -- i.e. in the context of L* H- versus, say, H* L-. When the nuclear accent is on an early syllable in the last word in the phrase, the L- of a H* L- sequence seems to kick in very immediately with a sharp fall that typically begins during or just after the accented syllable. In the analogous situation, the rise from a L* to a H- begins as early, but the f0 change is much more gradual. Here, for example, the f0 seems to be rising continuously from about a third of the way into the accented syllable all the way to the end of the phrase. In this case, there is no real inflection point leveling out into a plateau.

      The last clause of the first production in <<oregano>> also illustrates anew the difficulty mentioned above in connection with utterance <<mother4>> in Section 2.2. What is the best analysis of the fall to a low level immediately after "marjoram" and subsequent rise to a high f0 on "fresh"? How can we distinguish, say, a sequence of L* H* from the L+H* that we have transcribed? One thing to note is that, since accented syllables must be stressed, other characteristics of a syllable must be compatible with a tonal analysis that puts a pitch accent on it. The words "and" ("'n") and "some" here do not sound stressed at all. Both have been reduced to the point that they have syllabic nasals as their nuclei. This supports the analysis of L+H* on "fresh" over an analysis of L* H*, even though the fall from "marjoram" looks so much steeper than the gradual rise from "and" back up to the H tone on "fresh" that the f0 pattern may seem more compatible with a L* on "and". Note, however, that there may be mistracking due to breathy voice on "and". Also, the "some" shows a strong perturbation from the initial voiceless [s] that obscures how low the intended f0 is later in the syllable.

    4. Difficult combinations of nuclear pitch accent and following

      phrase accent
      In most of the examples so far, nuclear H* has occurred before L-, where the following fall in pitch makes it easy to discern the pitch accent, and nuclear L* has occurred only before H-, where it was easy to spot from the immediately following rise in pitch. But the choice of pitch accent type is independent of the choice of following phrase accent, and there is nothing to preclude H* from occurring before H- or L* before L-. The second production in <<oregano>> illustrated the first case of this `stylized high-rise' contour (Ladd, 1980), which is becoming more and more familiar to contemporary American English speakers.

      The combination of L* and following L- is also not rare. There are two situations where this sequence is typically encountered. The first is illustrated in <<nose>>, and the first sentence in <<tags>>. This L* L- H% pattern is typical of such vocative tags. The second sentence in <<tags>> shows that tag questions can have this contour too. However, tag questions can also take a H* L- L% intonation pattern (the third sentence in <<tags>>), which seems to be precluded on the vocative tag for pragmatic reasons (see Beckman & Pierrehumbert, 1986).

      EXAMPLE <<nose>>: Oh don't nuzzle me you marmalade-nose. X*? L- H* !H* L- L* L-H% [GIF}
      (Section 2.8 will explain the `!' diacritic in the second pitch accent, and Section 2.9 will explain the X*? accent on "Oh".)

      EXAMPLE <<tags>>: 1) Where are you going, Willy? H* L- L* L-H% 2) He won't be going, will he? H* H* L- L* L-H% 3) He won't be going, will he? H* H* L- H* L-L% [GIF} [GIF} [GIF}

      The f0 contour for the L* L- H% vocative tag contour can be confused with a longish postnuclear stretch in the sequence H* L- L%, as shown in example utterance <<vocative1>>. As with medial L- intermediate phrase boundary versus L- L% intonation phrase boundary discussed above, the transcriber may have to rely entirely on the subjective impression of greater versus lesser disjuncture to capture this difference between an intermediate phrase boundary at a vocative tag and no boundary. (Note again that the difference here must be accompanied by different symbols on the break index tier.)

      EXAMPLE <<vocative1>>: 1) Anna will win, Manny. H* L- L* L-H% 2) Anna will win Manny. (She won't lose him). H* H* L-L% [GIF}

      The other situation in which one often sees a L* nuclear accent and following L- phrase accent is in the `contradiction contour', an intonational idiom illustrated in <<gloria>> and <<elephant3>>. This contour is discussed at length in Sag & Liberman (1975) and chapter 3 of Ladd (1980). The L* L- H% sequence starting at the nuclear syllable is like the contour in the vocative tag, but this is not the only essential component of this intonational idiom. Crucially, there must be a fall from an early prenuclear H* pitch accent (or from an initial %H boundary tone -- see next section) onto a nearby L* accent. If the L* nuclear accented syllable is far from the beginning of the utterance (as is the case with the nuclear accent on "incurable" in <<elephant3>>), there might be another L* on some prenuclear syllable with relatively prominent secondary stress (e.g., the fourth syllable of "elephantiasis" in <<elephant3>>). Note that <<elephant3>> also illustrates the possibility of having two pitch accents on one word when there is more than one full stressed syllable (see Section 2.9 for more examples).

      EXAMPLE <<gloria>>: Ah Gloria you're not ugly. H* L* L-L% H* L* L-H% [GIF}
      EXAMPLE <<elephant3>>: Elephantiasis isn't incurable. H* L* L* L-H% [GIF}

    5. The initial %H boundary tone

      The contradiction contour also illustrates another phenomenon that we have not discussed so far -- namely, the possibility of a boundary tone marking the initial as well as the final boundary of an intonation phrase. Utterance <<bananas>> is an example. Here the event that provides the high pitch for the early fall onto the L* tone cannot be an accent, since the first syllable of "bananas" is reduced (i.e. completely unstressed and hence unaccentable).

      EXAMPLE <<bananas>>: Bananas aren't poisonous. %H L* L* L-H% [GIF}

      In the intonational analysis assumed in the ToBI system, the final boundary tone is mandatory, whereas an initial one is not. The initial boundary tone differs from the final ones also in that it seems to be limited to absolute utterance-initial position, and in that it is always high. Thus, unlike the final boundary, where there is a paradigmatic choice between L% and H%, the phrase-initial boundary tone contrasts merely with the absence of a boundary tone. That is, %H contrasts with the default (unmarked) initial pattern, which in absolute utterance-initial position tends to start in the middle part of the speaker's pitch range (as opposed to beginning of utterance-medial intonation phrases, where the pitch simply continues from the value at which the previous phrase ended). This utterance-initial midrange pitch value is illustrated in <<loan1>>, where the first and second productions show a rise from the mid value to H* and a fall from the same default mid value to L*, respectively. The third production then contrasts with the second in that it has an initial %H boundary tone. These two examples also illustrate the typical effect of having an initial %H boundary tone in the surprise-redundancy contour. The one with the initial %H has a greater vividness, conveying either more surprise or more insistence that this is the information that the hearer should really already know.

      EXAMPLE <<loan1>>: You need a loan. In three productions 1) H* H* L-L% 2) L* H* L-L% 3) %H L* H* L-L% [GIF}

      The ToBI conventions prescribe that %H be an analysis of last resort. That is, like L+H*, which is used instead of H* only when there is no other possible explanation for the low pitch before the peak (see Section 2.2, above), %H is used only when no other plausible explanation for an initial high pitch. It should be marked only when a high-pitched beginning for an utterance cannot be attributed to a H* accent on the first few syllables in the utterance -- i.e., when the first word itself does not appear to be accented or when its accented syllable occurs too far into the word to account for the initial high target. Thus it should not be used in <<gloria>>, where the high pitch at the beginning of the phrase "You're not ugly" is attributed to a H* accent on the first syllable "You're". In <<elephant3>>, similarly, although the main stress of the word "elephantiasis" clearly is at the L* accent on the fourth syllable, we have the option of analyzing the earlier high pitch as another pitch accent earlier in the word, since the first syllable has a lexical `secondary stress' (i.e. is rhythmically more prominent than the surrounding syllables).

    6. Pitch accent timing, and the L*+H pitch accent

      The examples so far have illustrated both possible phrase accents, both boundary tones, and three of the five types of pitch accent -- the low accent (L*), the plain `peak accent' (H*), and the `rising peak accent' (L+H*). We have also discussed the timing of the f0 peak in the two types of peak accent, pointing out that it is somewhat variable; in particular, that it occurs somewhat later relative to the segments of the accented syllable when it is the accent at the beginning of a `hat pattern' contour and relatively earlier before L- (see Section 2.2). Such differences in timing are not distinctive, and seem to be related to the phonetics of pre-boundary lengthening. For example, we might think of the relatively earlier placement of the peak in the latter case as a matter of lengthening the part of the syllable after the nuclear pitch accent peak in order to accommodate the L- phrase accent within the intermediate phrase (see, e.g., Silverman & Pierrehumbert, 1989, for a discussion of such phonetic accounts). These phonetic differences in timing can be ignored in the transcription on the tone tier.

      There is another difference in the timing of apparent peak accents, however, that must not be ignored, because it is distinctive. Both the small rise from mid pitch that is usually seen with an utterance-initial H* accent and the definitive rise from low pitch that is necessarily seen to transcribe a L+H* accent contrast phonologically with another accent type that involves a rise from low pitch into a peak that occurs much later, making the low tone align with the accented syllable. This is the `scooped' accent L*+H, illustrated in the first production of <<millionaire>>. The second production in this example utterance is of the contrasting `rising peak' accent L+H*. These two pitch accents have very different meanings, as described by Ladd (1980) and Ward & Hirschberg (1985), and the difference in timing here is a phonological difference that is represented in the ToBI system by the contrasting specifications of L*+H versus L+H*. That is, phonologically, both of these accents are a L plus a H, but in the `scooped' accent, the L is the starred tone (associated to the accented syllable) rather than the H. The associated phonetic difference is that the rise is much later in the `scooped' accent, and it is the timing of the minimum f0 relative to the segments of the associated syllable that is salient.

      EXAMPLE <<millionaire>>: Only a millionaire. in two productions 1) H* L*+H L-H% 2) H* L+H* L-H% [GIF}

      Because it is the L target in the `scooped' accent that is associated to the stressed syllable, and not the H, the high pitch target is specified only as occurring somewhat later than the L, and the timing of the peak f0 relative to the segments is not controlled. If the stressed syllable is long, the rise to the peak might be accomplished entirely within the accented syllable. But if the stressed syllable is short, the peak may occur one or more syllables later. This is illustrated in example utterance <<stein>>, which shows a relatively fixed rise relative to the low f0, which makes the peak occur within the last part of the long syllable "Stein" but two syllables later relative to the short accented first syllable in "rigamarole".

      EXAMPLE <<stein>>: Stein's not a bad man. L*+H L-H% Rigamarole is monomorphemic. L*+H L-H% [GIF}

      Note that although the crucial difference between L+H* and L*+H is the timing of the low pitched portion, some speakers produce a secondary difference, whereby the L of L*+H is consistently somewhat lower in the pitch than the L of L+H*. This is particularly apparent in the first accents of the two productions in <<bloomingdales>>. This also means that L*+H is not nearly so confusable with H* as is L+H*. There is also considerable interspeaker differences. Some speakers have rather mid-level L tones even in L*+H. (This fact will be relevant when you transcribe <<noodle1>> and <<noodle2>> in PRACTICE THREE.) Other speakers have very low L tones even in L+H*. This does not affect the relative heights of the L's in L*+H versus L+H*.

      EXAMPLE <<bloomingdales>>: There's a lovely one in Bloomingdale's. in two productions: 1) L*+H L*+!H L-H% 2) L+H* L+!H* L-L% [GIF}

      Another thing to note in the sequence of accents in these two productions is the introduction of another set of symbols for the second (the nuclear) accents. These new symbols are actually the same accent types as the first accent in their respective utterances. The extra `!' in the symbols for the nuclear accents is a diacritic to denote the way in which the second accent peak is lower than the preceding peak. This lowering of the second peak is due to a process called `downstep', which is defined as a categorical compression of the pitch range that reduces the f0 targets for any H tones subsequent to the specification of the downstep -- i.e. the counterpart of the `upstep' triggered by the H-. We will describe downstep and what triggers it in more detail in Section 2.8 below after introducing the last remaining pitch accent type in the ToBI analysis, transcribed as H+!H*.

      Finally, as in deciding how low the f0 must be to count as L+H* rather than H*, transcribers should be aware of slight interspeaker differences in the timing of the L tone in differentiating L+H* from L*+H. Our impression is that American speakers (such as the speaker of <<bloomingdales>> do not always make L*+H rise as late as most RP British speakers do. The second (downstepped) L*+!H on the word "Bloomingdale's" in the first production, in particular, might seem quite early to a British transcriber. Note, however, that there is a very low pitch level throughout the [b] and the [l], and the f0 does not begin to rise until the voicing begins in the [u], making the peak occur considerably after the [m] release. This is quite late for a nuclear L+H* before a L- (cf. our comments above in Section 2.2.), as can be seen by comparing this rise to the rise in the comparably downstepped nuclear L+H* in the second production. In the second production, the rise begins before the [b] and is completed well before the release of the [m].

    7. The H+!H* pitch accent

      The nuclear accent in the second production in example utterance <<theresa>> illustrates this pitch accent type. It is characterized by a fall from a preceding higher pitch onto a lower pitch level on the accented syllable. This accent type corresponds to the type called H+L* in Pierrehumbert's original system. The substitution of the letters `!H' for `L' in the name of the pitch accent reflects the fact that the pitch target on the accented syllable is only somewhat lower than the preceding H tone target; it is not so low as the f0 target for the plain `low' accent (L*) or for the L tone of the `scooped' accent (L*+H), or even for the L of the `rising peak' accent (L+H*). The renaming of this pitch accent type was intended to make the analysis somewhat more concrete and intuitive for the transcriber.

      EXAMPLE <<theresa>>: You want an example? How about Mother Theresa? H* H* H-H% H* *? H* L-L% You want an example? Mother Theresa. H* H* H-H% H+!H* L-L% A HREF="AU/">[GIF}

      (The *? in the first production indicates uncertainty about whether that word is accented -- see Section 2.9.)

    8. Downstep

      Downstep is a phonologically triggered compression of the pitch range that lowers the f0 targets for any H tones subsequent to a downstep trigger. In Pierrehumbert's model of intonation, downstep is said to be triggered by any bitonal pitch accent. In example utterance <<bloomingdales>> discussed above, for example, the progressive reduction of the second L+H* or L*+H peak relative to the preceding one would be analyzed as an automatic consequence of the fact that these two pitch accent types are composed of two tones, L plus H.

      In the ToBI system, this compression of the pitch range is marked by having alternative names for accents which are used for the first downstepped high tone target after the downstep trigger. Thus in the first production in example utterance <<bloomingdales>>, the second `scooped' accent is transcribed with L*+!H rather than L*+H to denote that a downstep has occurred. And similarly in the second production in this example utterance, the second `rising peak' accent is transcribed with L+!H* rather than L+H*. When there are more than two such bitonal accents in a row, each accent triggers another instance of downstep, so that each subsequent accent peak is reduced yet again relative to the immediately preceding one. This is illustrated in example utterance <<yellow2>>. Example utterance <<calling>> shows that it is not just pitch accents which are affected by downstep. The !H- phrase accent here is reduced to a mid level by the downstep triggered by the preceding L+H* nuclear pitch accent. (Note the characteristic mid-tone tail, as the downstepped !H- phrase accent triggers a subsequent upstep of the L% boundary tone.)

      EXAMPLE <<yellow2>>: There's a lovely yellowish old one. H* L+H* L+!H* L+!H* L-L% [GIF}
      EXAMPLE <<calling>>: Marianna. L+H* !H-L% [GIF}

      Transcribers who are familiar with Pierrehumbert's system will recognize that ToBI differs in this explicit marking of the reduced pitch range directly on the first H tone affected by the downstep. In Pierrehumbert's system, downstep is not explicitly marked because it is redundant to the specification of the trigger in the preceding bitonal accent.

      Pierrehumbert's system differs from ToBI in yet another way; it includes a sixth pitch accent type, H*+L, which bears the same relationship to H+L* (ToBI's H+!H*) as L*+H does to L+H*. That is, the fall to a slightly lower pitch target occurs after the accented syllable instead of into the accented syllable. Typically, the endpoint of this fall is no lower than the pitch target of a subsequent downstepped H tone, and the contrast between H* and H*+L thus hinges on recognizing the downstep triggered by the H*+L.

      Many first time transcribers find this comparatively abstract analysis unintuitive and therefore difficult. In the ToBI system, therefore, we have eliminated H*+L in favor of marking the downstep directly on the first reduced H tone. Thus H* in ToBI corresponds to both plain H* and the downstep triggering H*+L. Users of databases transcribed with the ToBI system who need to analyze the data in terms of the intonational categories in Pierrehumbert's system, can recover each H*+L tone by searching for a downstepped !H* or !H- marked immediately after a H* (or !H*) accent. For example, in utterance <<really1>> the second production is a plain `hat pattern' (H* H* L- L%) whereas the first is a `downstepped hat', which would be transcribed as H*+L H* L- L% in Pierrehumbert's system. The second production in utterance <<calling2>> illustrates another very familiar intonation pattern, the `calling contour', which in Pierrehumbert's system would be transcribed with H*+L H- L%.

      EXAMPLE <<really1>>: That's really illuminating. in three productions 1) H* !H* L-L% 2) H* H* L-L% 3) Transcribe this one in PRACTICE THREE [GIF}
      EXAMPLE <<calling2>>: Anna. in two productions 1) L* H-H% 2) H* !H-L% [GIF}

      A fact to note about downstep is that it is local to an intermediate phrase. Each new intermediate phrase represents a new paradigmatic choice of pitch range, at which downstep can be reset. This is illustrated in <<yellow3>>, where the intermediate phrase boundary after the "yellowish" allows a new choice of pitch range, so that the peak on "old" is not downstepped relative to that on "yellowish", unlike in <<yellow2>>. (See Section 2.9 to read about the X*? symbol marking the peak on "old". It indicates that there is an accent on "old" but we are not completely certain which type of accent it is. It could be simply H* after an unexpectedly steep rise from the preceding L-. Or it could be L+H*, with a less steep rise than expected.)

      EXAMPLE <<yellow3>>: It's lovely and yellowish, and it's an old one. L+H* L+!H* L- X*? L-L% [GIF}

      Note, however, that the peak on "yellowish" looks downstepped relative to that on "lovely". This is due to the relationship between the pitch ranges chosen for the two intermediate phrases. The topic structure of a discourse is marked in part by the choice of pitch range for the succession of intermediate phrases; large topics begin with expanded pitch range and end with very reduced pitch range (see, e.g., Brown, Curie, & Kenworthy, 1980; Hirschberg & Pierrehumbert, 1986). This often creates an effect of `paragraph intonation' in which the relationship among successive phrasal pitch ranges mimics the phrase-internal relationship between preceding peaks and subsequent following lower downstepped peaks. The successive pitch ranges in utterances <<park1>> through <<park5>> in PRACTICE TWO above illustrate `paragraph intonation' over a longer time frame.

      Sometimes it is not easy to tell the difference between two phrase-internal accents with the second downstepped relative to the first and two intermediate phrases with the second phrase in a lower pitch range relative to the first. Example utterance <<levels>> illustrates such a difficult case.

      EXAMPLE <<levels>>: There are many intermediate levels. L+H* L+!H* L+!H* L-L% [GIF}
    9. Uncertainty about accent placement and accent type

      In addition to conveying topic structure, pitch range variation is used for many discourse purposes. For example, a lower (or higher) pitch range than surrounding phrases can be used to set off a stretch of speech as an aside or a parenthetical. This is illustrated in example utterance <<capote>>. Also, expanded pitch range can convey extra liveliness or involvement, as illustrated in the much larger pitch range on the phrase "Now be careful" in <<onions>>.

      EXAMPLE <<capote>>: Capote died Saturday at the Bellaire home of L+H* !H* L- H* H* L+H* L- Joanne Carson (estranged wife of talkshow host L+H* L-L% L+H* L- H* *? Johnny Carson), and she was among those who H* L-L% L+H* !H* !H* eulogised him. H+!H* L-L% [GIF}
      EXAMPLE <<onions>>: Okay now chop the onions... Now be careful. H* L- H* H+!H* H- L* L+H* L-H% Okay, chop the onions, and put them into that bowl. L+H* L-L% H+!H* H- H* H+!H* L-L% [GIF} [GIF}

      Because speakers can vary their pitch ranges seemingly without limit to convey discourse organization or degree of involvement, and because downstep can happen many times within a single phrase, sometimes it is difficult to tell whether a tone is H or L, even when one is sure a tone is there. The accents on "smoke" and "yeah" in the asides in example utterance <<smoke>> illustrates this. Or, in very reduced pitch ranges, it can be difficult to tell whether a syllable is accented or not. Utterance <<sold3>> illustrates this. In the very reduced pitch range after the second downstep on "else", it is impossible to know how many accents there are.

      EXAMPLE <<smoke>>: Can I smoke? <<interviewer says "You can smoke.">> X*? H-H% Yeah? <<interviewer: "Does this door have to stay open?" If it>> No, it doesn't have to be; you can close it. EXAMPLE <<sold3>>: He sold it to somebody else, they bought the H* !H* !H* *? whole company, and he made lots of money on *? -X? *? *? the business... H* L-L% [GIF} [GIF}

      In the first type of uncertainty, ToBI prescribes that the transcriber use the notation X*? to simply mark the clear presence of an accent, without forcing an arbitrary commitment to the accent's type. Thus, the accent on "smoke" should be transcribed as X*? rather than as L* or H*. (X*? should not be used to mark uncertainty between the L- and H- phrase accents or between the L% and H% boundary tones. There the transcriber should instead mark X-? for the phrase tones and X%? for the boundary tones.) In the second type of uncertainty, when the transcriber is not certain even that there is a pitch accent (as, for example, on the "bought" in utterance <<sold3>>), the mark *? should be used instead of X*?.

      In addition to very compressed pitch ranges, there are several particular tone sequences which are prone to inducing uncertainty about the presence of accent. One such case is the downstepped H* !H* !H* ... sequence just illustrated. In many cases, words after the first H* in such sequences are ambiguous between being accented with !H* and being `deaccented' (i.e. being in the postnuclear low stretch in a H* L- L% sequence). This is not always ambiguous, however. Utterance <<anna2>> illustrates a clear contrast between downstepped and deaccented.

      EXAMPLE <<anna2>>: Anna married Lenny. in two productions 1) H* L-L% 2) H* !H* !H* L-L% [GIF}

      Another case of inherently ambiguous tone sequences which occurs very commonly is when there is a long stretch of speech in a `hat pattern' contour. Utterance <<peel>> illustrates this. The word "off" sounds very prominent, giving a strong subjective impression of accent. However, because the word lies in the plateau between the first H* on "peeled" and the nuclear H* pitch accent on "Hawaii", it is difficult to tell whether "off" also bears a H* accent, or just the preceding "peeled". The word "host" in the phrase "talkshow host Johnny Carson" in <<capote>>, the word "married" in the first production of <<anna>>, and the word "Mother" in the first production in <<theresa>> in Section 2.7 above are three more illustrations of this very common ambiguity. The first production in example utterance <<made4>> (given in Section 1.4 above) illustrates the analogous situation with a L*; "Marianna" probably has a L* accent (note the dip down into it from the mid pitch level that begins the sentence) and "marmalade" clearly has a L* accent, but what about the "made" in between?

      EXAMPLE <<peel>>: [Ever since the roof of a 19-year old Aloha Airlines Boeing 737] peeled off over Hawaii last April, ... H* *? H* L- H* L-L%

      (See example utterance <<older-aircraft>> in the next PRACTICE for the whole context of this phrase.) [GIF}

      In cases such as these it is better to err on the side of conservatism and mark the word with *? or nothing. In particular, the transcriber should take care not to let grammatical expectations guide the marking of accents. If we find ourselves giving in to such thoughts as "This is a content word and therefore probably is accented", we preclude the use of our transcriptions to test whether content words are indeed likely to be accented.

      A final source of uncertainty is particularly true of transcribing sentences in isolation extracted from the context in which they originally occurred. This is uncertainty about accent type due to unfamiliarity with a particular speaker's normal speaking range for a particular style of speech. For example in utterance <<hurt>>, the nuclear pitch accent on "hurt" is probably L*; the pitch is lower than the "neutral" value at the beginning of the utterance. However, 200 Hz is very high for a low, and unless one knows from experience that this speaker has a very high-pitched voice, one might be tempted to transcribe this utterance with a H* nuclear accent.

      EXAMPLE <<hurt>>: But would it hurt you? *? X*? H-H% [GIF}

      Utterance <<beef>> also illustrates this point. Here we have transcribed each of the accents in the second speaker's response as L*, since we know from many other examples in this labelling guide that this speaker's normal range is higher than the first speaker's voice. This utterance also exemplifies an intonation pattern we have not shown before: a sequence of all low tones, for all of the accents, the phrase accent, and the boundary tone.

      EXAMPLE <<beef>>: Here's your Chateaubriand, ma'am. H* L+H* L- L* L-H% I don't eat beef. L* L* L* L-L%

    10. When something is accented that you would not expect

      It is important also to not fall into the obverse reasoning and hesitate to mark an accent simply because the accented word is not a content word. Example utterance <<AND1>> is a nice illustration of this point. Here, the pitch pattern unequivocally supports a nuclear `peak' accent on the "and". There is a suggestion of accent on the same word in <<hennessy>>, but less clearly; here the glottalization at the beginning of the word is the main cue that it is accented (recall that stressed vowel-initial syllables are set off by glottal stops -- Section 2.1).

      EXAMPLE <<AND1>>: improvements, and a schedule... H* H* L-H% H* L- H* L-L% [GIF}
      EXAMPLE <<hennessy>>: Hennessy is widely respected for his legal H* L- H* !H* L-L% L+H* !H- scholarship and his administrative abilities. H* H- H* *? H* L-L% [GIF}

      Another thing to watch out for is that in very emphatic speech, a word with two fairly strong syllables can bear two pitch accents. Utterance <<understand>> is such a case, where our normal expectation is that only the most stressed final syllable in "understand" will be accented. Here, however, the first syllable is also accented, so that the phrase "to understand" can realize the `surprise-redundancy' contour L* H* L- L%.

      EXAMPLE <<understand>>: I'm simply trying to get you to understand. H* L* H- L* H- L* H- L* H* L-L% [GIF}

      Example <<philadelphia>> gives another case of such double accents, apparently without the impetus of realizing any particular intonational idiom.
      EXAMPLE <<philadelphia>>: from Philadelphia to Dallas
                                    L+H* !H*   L-    H*   L-L%

      This phenomenon of double accents, and the apparently related phenomenon of `stress shift', have been examined extensively by Stefanie Shattuck-Hufnagel and her colleagues (see Ross, Ostendorf, & Shattuck-Hufnagel, 1992) in a corpus of newscasts which they have transcribed using something like the ToBI system. Some of the exercises involving this phenomenon are from this study. It is apparently fairly common.
    11. The point of highest f0

      The only other label on the tone tier which we have not discussed is the transcription of HiF0, the point of highest fundamental frequency associated with a pitch accent within an intermediate phrase. HiF0 is used currently as a rough measure of the phrase's pitch range. It is transcribed only for intermediate phrases in which there is an accent with a H component -- i.e. a H*, L+H*, L*+H, or H+!H* accent. Thus, for example, in <<made4>> (given above) one should not transcribe HiF0.

      The summary statement in the Annotation Conventions (Appendix A) offers the following advice about HiF0:

        Transcribers should take reasonable care to choose a point in time that reflects the target of the H for the accent. In several cases this will mean choosing some point other than the actual f0 maximum. For example, sometimes the highest f0 value in an accented syllable reflects the `intrinsic' effect of a voiceless consonant and will thus be a poor estimate of the speaker's choice of pitch range. More seriously, in a phrase where the highest accent-related f0 occurs in a H* H- H% sequence, choosing the absolutely highest value for HiF0 will artifactually inflate the pitch range estimate by the amount of the upstep on the H%. In such cases, we recommend that the syllable's amplitude contour be used to pinpoint HiF0 within the candidate region.
      For an example, see <<good1>> from PRACTICE TWO. HiF0 would be at the amplitude peak for "good" at about time 2.61.

