Speech Link Protocol Specification

Version 1.0


November 23, 2000






Copyright © 2000-2001 SpeechWorks International, Inc.

This work may only be distributed under the terms of the SpeechWorks Open Document License v1.0.

Table of Contents

Introduction. 3

Overview of Speech Link Functionality. 3

Definitions. 3

Caller 3

Initiator 3

Recipient 3

Setup. 3

PSTN.. 3

Termination. 3

Transfer 3

Transfer Target 3

Transferee. 3

Transferor 3

Overview of Speech Link Operation. 3

Speech Link Protocol 4

Speech link addresses. 7

Call Setup - Simple. 9

Limitations. 18

Call Setup – With Handshake. 22

Limitations. 33

Call Termination. 35

Call Transfer 38

Limitations. 45

Call Transfer with Return. 48

Message Formats. 56

Message Body. 58

Speech Link Headers. 65

Sample Call Setup Session. 94

Summary of Recipient Requirements. 147

Call Setup. 150

Call Termination. 156

Transferring a Call 160

Summary of Initiator Requirements. 164

Call Setup. 165

Call Termination. 169

Call Transfer 171


Overview of Speech Link Functionality

The speech link protocol is an application-layer control protocol for transferring callers between cooperating speech applications, pre-existing interactive voice response (IVR) applications based on dual-tone multi-frequency (DTMF, or TouchToneÔ) input, and human agents both in call centers and in other locations.  The voice path of the call can be carried via the Public Switched Telephone Network (PSTN), a Voice over IP (VoIP) network, or a combination of the two.  The control of the call is handled via an IP network, usually the internet.  A connection managed via the speech link protocol is referred to as a “speech link”.

Speech links use the Session Initiation Protocol (SIP - IETF RFC2543) to implement call control.  Information about the call is exchanged between cooperating applications using MIME-encoded user data in the SIP message body.  This message is specified via the MIME type “application/speechlink”. The speech link protocol uses speech link specific headers that are described further.  Applications can exchange application-specific data through the body of the speech link message. The body of a speech link message is itself a MIME message whose type is application-dependent.



Typically a person communicating using a telephone or similar device to a speech application.


The party initiating a speech link.  The initiator must already be in communication with the caller and know the speech link address of the recipient.  The initiator must be able to transfer calls on behalf of a recipient.


The party receiving a call from an initiator.  Recipients can become initiators in their own right, bridging the call between the party that initiated the call with them, and the new recipient.  Recipients can also request that the initiator transfer the call on their behalf.


The process of an initiator and a recipient exchanging SIP messages and PSTN signaling to establish a voice path between the caller and the recipient.


The Public Switched Telephone Network.


The process of concluding a call between an initiator and a recipient.


The process of transferring a caller from one recipient to another one, via the initiator.

Transfer Target

The party to whom a caller is being transferred.


The party transferring a caller on behalf of a previous recipient (the transferor) to a new one (the transfer target)


The recipient requesting that the initiator act as a transferee in transferring a caller to a new recipient (the transfer target).

Overview of Speech Link Operation

Speech link require that a caller (usually a person) be connected to or through a voice network connection that is managed by an intelligent application.  This application can act as the initiator of a speech link to connect the caller to another application or person.

To establish a speech link, the initiator must know the “speech link address” of the intended recipient, identified by a speech link URI, which may be the same as the recipient’s SIP URI.

By exchanging IP messages in the speech link protocol with each other, the initiator and the recipient establish a voice connection between the caller and the recipient.  In the case of a PSTN connection, this is usually “bridged” (or “hairpinned”, or “tromboned”) between the initiator and both the caller and recipient.

Recipients can ask the initiator to transfer the caller to another recipient via a speech link address, and optionally can request the initiator to return the caller if the second recipient terminates the call.

Speech Link Protocol

While SIP is responsible for the session management, it is not responsible for the actual transmission of the voice data. The initial implementation of speech link uses the PSTN (Public Switched Telephone Network) for carrying the voice data.

The following diagram shows the connections between two speech link sites and their components.


The calling speech link  site (the “initiator”) uses SIP messages to reach the called speech link site (the “Recipient”) over IP. In the process of establishing the SIP connection, the called speech link sends a PSTN phone number to the calling speech link.  To do this, the called speech link sends a SIP provisionary response with the phone number to dial included in the body of the message.  Once the calling speech link has received the phone number and has established the connection with the called speech link through PSTN, the session is established.

This is a major difference with regular SIP-to-SIP dialing.  In a regular SIP-to-SIP call, the media path is established through IP when the call is answered (200 OK response sent by the callee).  Also, with regular SIP-to-SIP dialing, there usually is an IP data path created between the calling User Agent (UA) and the called UA.  In speech link framework using the PSTN, there is no IP data path created when the connection is established.

Speech link addresses

To establish a speech link, the initiator must know the “speech link address” of the intended recipient, identified by a speech link URI, which may be the same as the recipient’s SIP URI.  Typically, this is of the form sip:user@host, or sip:speechlink@host, where the first part is either an individual or group name or telephone number, and the second part is either a domain name or a numeric network address.

Call Setup - Simple

The simplest form of call setup can be used to establish a speech link to any directly dialed telephone number on the PSTN that has been established with a specific speech link address.

A normal call follows the sequence of messages shown in the state transition diagram below.


1.        The initiator sends a SIP INVITE, optionally including application data in the message body of the SIP message. 

2.        The recipient returns a provisional SIP response (183 Session Progress).  The recipient must return an E.164 dial number (DN) in the body of the response.

3.        When the initiator receives a provisional return-code that contains the DN, it dials the number specified in the body of the message and connects the voice path of this call leg to the caller.

4.        The recipient detects ring and seizes the line.  Before or after seizing the line it must send the SIP 200 OK response, indicating the success result for SIP. 

5.        The initiator waits for both a return-code 200 and detection of the line being answered using call progress detection. After the initiator detects both events, it immediately sends an ACK to acknowledge the call set-up.  When the recipient receives the ACK the call is fully set-up.


This simple protocol does not ensure that the user data in the message body is associated with the PSTN call that the recipient answers.  It is possible that another non-speech link caller might originate a call to the DN that would ring the line between steps 2 and 4 of the sequence above. 

The caller will also hear all the progress signaling of the call leg being set up – busy signals, ringing, Special Information Tones, etc.

In some network environments, call progress detection is dependent on the initiator recognizing something that sounds like speech or an answering machine on the recipient side.  Should it fail to detect this, it may not be able to determine when, or even if the call has definitely been completed successfully.  This may affect the accuracy of billing and reporting, and may even leave a caller listening to “dead air” indefinitely.

Call Setup – With Handshake

This more complex form of call setup eliminates the limitations of the simple method described above, but requires coordination by the recipient of the SIP and PSTN connections.  It also supports managing a group of incoming lines with a single dialed number.  A normal call using this method looks like this:



1.         The initiator sends a SIP INVITE, optionally including application data in the message body of the SIP message (same as above)

2.        The recipient returns a provisional SIP response (183 Session Progress).  The recipient must return an E.164 dial number (DN) in the body of the response, and also supplies a sequence number (SN) that must be temporally unique for the recipient speech link address.  The presence or absence of this header distinguishes the type of call setup the recipient supports.

3.        The initiator dials the DN specified in the body of the message, but does NOT connect the voice path of this call leg to the caller yet.

4.        The recipient detects ring and seizes the line, and sends a DTMF # on the line.

5.        The initiator waits for both a return-code 200 and detection of the line being answered using call progress detection. After the initiator detects the #, it sends SN# down the telephone line.

6.        After receiving a valid SN#, the recipient must send the SIP 200 OK response, indicating the success result for SIP.

7.        When it receives the 200 OK, the initiator connects the caller voice path and immediately sends an ACK to acknowledge the call set-up.

Initiators must support both the simple call setup, and the handshake version.


This protocol is only suitable for automated systems at both the initiator and recipient.  Should a human caller reach the designated DN directly, they will hear only a DTMF #.  It would be possible for the recipient to combine this signal with a voice prompt message, and perhaps transfer the call if no SN# is received within a specific time limit.

Call Termination

During an active speech link call, if the caller terminates the connection (hangs up), the initiator must disconnect the PSTN link to the recipient, and send an SIP BYE request.  The voice path disconnect and the BYE request may occur in either order.  The recipient must respond with a 200 OK message, and of course terminate the PSTN connection on its side as well. If the recipient wishes to return application information, it must issue a BYE of its own prior to sending the 200 OK response; the initiator must in turn respond to this BYE with a 200 OK.

The recipient can also send a BYE request to the initiator at any time during an active speech link, and drop its PSTN connection.  The BYE and the PSTN disconnect can occur in either order.  The initiator must respond with a 200 OK, but need not send its own BYE.

Call Transfer

After a call has been set up and is active, a speech link recipient (the Transferor) can request the initiator (the Transferee) to transfer the call to another speech link recipient (the Transfer Target).  This is an alternate way of terminating the call leg to the Transferor.  It will often be more cost-effective than the recipient setting up a second call as an initiator and bridging the two call legs.

Transferring a call from one speech link enabled site to another speech link enabled site is accomplished using the REFER, INVITE and ACK messages, as defined in the July 2000 issue of the SIP Call Control Transfer Internet Draft.  The steps in a successful transfer are shown below:

1.        The Transferor starts the transfer by putting the Transferee on hold, it does so by sending an SIP INVITE request with a modified session description.  In keeping with the SIP convention (B.6 of RFC2453), the session description is the same as in the original invitation, but the DN is set to 0.  It may optionally include application data that will be forwarded on to the Transfer Target.

2.        The Transferee responds with the SIP 200-OK Response and suspends the voice path to the Transferor.

3.        The Transferor sends an ACK to complete the INVITE sequence, and then sends a REFER request specifying the speech link address of the Transfer Target.

4.        The Transferee sends an SIP 100-Trying Response to the Transferor, and sends an INVITE to the Transfer Target to begin a Call Setup.  The INVITE must include all call data provided by the Transferor in Step 1, and optionally additional data provided by the Transferee.  The rest of the Call Setup sequence proceeds as described above.

5.        When the Call Setup is complete, the Transferee sends a 200-OK Response to the Transferor, and initiates a Call Termination sequence with a BYE Request.  The rest of this sequence continues as described above.


Only a speech link recipient currently active on a call can initiate a Transfer.

Speech link do not support the Consultation Hold or Protected Transfer Target methods of using the SIP Transfer Request.

Call Transfer with Return

The Transferor may optionally request that the call be returned when the Transfer Target would otherwise terminate the call.   To do so, it simply specifies a speech link address to return the call to as part of the INVITE (hold) Request that initiates the transfer.  [Although it might be clearer to specify this option as part of the REFER Request, the current RFC does not support a message body as part of this message.]

This speech link “Return Address” must be passed to the Transfer Target by the Transferee to inform it that a transfer has been requested; the rest of the protocol for setting up the transfer is the same as the simple case above.

The Transferee is responsible for initiating a Call Setup with the Transferor when the call with the Transfer Target is terminated, even if the PSTN connection to the caller has also been disconnected.  The SIP Call-ID remains the same as in the initial Call Setup, and additional or changed application data may be sent on the Invite, including all application data provided by the Transfer Target as part of its BYE Request.

In the case of reporting a disconnect, the Transferor sets the CO (connected) header in the message body to “FALSE”.  If it is absent, the value “TRUE” is assumed.

Should the returning Call Setup fail, the Transferee may continue the call with the caller, or may disconnect the caller.

Message Formats

In SIP, data that is standard to SIP, such as the Call-ID, are sent in the message header.  Speech link specific data, such as the DN or the SN, and application data are passed in the message body.

Message Body

SIP uses MIME to describe the message body.  In speech links, the message body is communicated using the MIME type “application/speechlink”.  So, following SIP, speech link messages always define the message body with the following header fields:


                Content-Type: application/speechlink

                Content-Length: <length>


An empty line (CR LF) indicates the end of the SIP header fields, which is followed by <length> bytes of message body.

Speech Link Headers


Optional E.164 format number identifying caller’s originating telephone number


INVITE Request


E.164 format telephone number to dial to establish voice path


183-Session Progress Response


Sequence number.  Should be as few digits as possible to minimize call setup time, but allow enough values to avoid having multiple calls being set up with the same sequence number


183-Session Progress Response


Optional speech link address to which the call is to be returned following termination by the Transfer Target


Transfer request


Indicates whether caller is connected or not on a return from a transfer; value is TRUE  (assumed if missing), or FALSE.


INVITE Request (back to initial Transferor)


Optional SpeechCookie header reserved for use by the Speech Media Alliance


INVITE, BYE Requests


Speech link silently ignores other headers.  This allows applications built on top of speech link to communicate information through the same mechanism.

An empty line (CRLF) indicates the end of the speech link header fields, which is then followed by application-specific data.  The format of this data is application-specific.

A typical INVITE message body would look like:

Content-Type: application/speechlink
Content-Length:  97
ANI: 16174284444
Content-Type: text/plain
Content-Length: 34


This is application specific data.


A typical message body of a Response to an INVITE would look like:

Content-Type: application/speechlink
Content-Length: 23
DN: 12125156368
SN: 123

Sample Call Setup Session

The following show a simple example of complete SIP messages used to set up a call.

1.         Initiator invites Recipient:

INVITE sip:UserB@there.com SIP/2.0
FROM: User A <sip:UserA@here.com>
To: User B <sip:UserB@there.com>
Call-ID: 12345600@here.com
Cseq: 1 INVITE
Contact: User A UserA@here.com
Content-Type: application/speechlink
Content-Length: 107
ANI: 15143234567
Content-Type: text/html
Content-Length: 45
<body>Invited by User A</body>

2.        Recipient sends a provisionary response with DN and SN:

SIP/2.0 183 Session In Progress
From: User A <sip:UserA@here.com>
To: User B <sip:UserB@there.com>
Call-Id: 12345600@here.com
Cseq: 1 INVITE
Content-type: application/speechlink
Content-Length: 23
SN: 123
DN: 16173436781

3.        Initiator dials the provided DN

4.        Recipient answers phone call and pulses a DTMF #:

5.        Initiator pulses the SN#

6.        Recipient sends a final response:

SIP/2.0 200 OK
From: User A <sip:UserA@here.com>
To: User B <sip:UserB@there.com>
Call-ID: 12345600@here.com
Cseq      : 1 INVITE
Contact: User B <sip:UserB@there.com>
Content-Length: 0

7.        Initiator sends the ACK request.

ACK sip:UserB@there.com
From: User A <sip:UserA@here.com>
To: User B <sip:UserB@there.com>
Call-ID: 12345600@here.com
Cseq: 1 ACK
Content-Length: 0


Summary of Recipient Requirements

The speech link protocol is designed to keep the implementation requirements for a recipient as simple as possible to encourage widespread adoption.  This section summarizes these minimum requirements, which should be possible with a small application using a minimally compliant SIP server (A.2 of RFC 2543).

In the following diagrams, each state transition is labeled by an event that causes the transition (above the horizontal line), and the message that is transmitted in response (below the line).

Call Setup

The recipient is responsible for aborting the call setup if it does not receive an incoming call within timer T3 of the INVITE Request.

In the following case where the recipient chooses to use the sequence number, it must also abort the set up if the sequence number is not provided in-band via DTMF, or if it does not match the one provided in the INVITE Request.


Call Termination

Call Termination is straightforward.  The recipient may optionally supply application data with the BYE Request; if it chooses not to do so, it need only respond to a BYE request from the initiator.  For a simple implementation, always generate a BYE in response to a detected line drop, or in response to a BYE from the initiator.  In either case, the recipient needs to clean up a call if a 200-OK Response is not received from the initiator in a reasonable time.




Transferring a Call

Receiving a call transfer as the Transfer Target is the same as being the recipient of a Call Setup.  By examining the RT header in the message body, the recipient can determine if a return was requested, but need not do anything with this information.

Initiating a transfer as the Transferor is an optional feature for a recipient.  If implemented with the Return capability, the recipient must maintain a list of Call-Ids for which returns are expected, and match these when invitations are received.  The recipient is also responsible for “timing these out” after some extended period of time when it concludes that the required INVITE has been lost by the Transferee.  It must also detect the CO header in the return INVITE, and if set to FALSE immediately refuse the connection.


Summary of Initiator Requirements

Call Setup

The initiator must support call progress detection, implementing at a minimum a timer T2 that expires after it has placed a call but not yet received indication of an answer from the recipient.  In addition, it must time out its initial INVITE after T1.  In the diagram below, if the “SN” header is not present in the message body, the “Connecting” state proceeds directly to “Completed” upon receipt of the 200-OK Response.



Call Termination

Call termination requirements for the initiator are the same as for the recipient.

Call Transfer

The initiator has significant additional responsibilities as the Transferee in support Call Transfer.  In addition to supporting the protocol described below, the Transferee must also maintain a return stack of at least 32 nested return locations for Transferors who request returns.  In case of a normal termination by a Transfer Target, it must set up a call to the top of this return stack.  In the case of a disconnect by the caller, it must loop through all return locations, and execute a Call Setup with the value of the CO header set to FALSE.