Internet DRAFT - draft-ietf-avt-transport
draft-ietf-avt-transport
HTTP/1.1 200 OK
Date: Tue, 09 Apr 2002 01:16:48 GMT
Server: Apache/1.3.20 (Unix)
Last-Modified: Wed, 09 Dec 1992 04:35:00 GMT
ETag: "3ddd9d-1b780-2b257774"
Accept-Ranges: bytes
Content-Length: 112512
Connection: close
Content-Type: text/plain
Internet Engineering Task Force H. Schulzrinne
INTERNET-DRAFT AT&T Bell Laboratories
October 27, 1992
Expires: 4/1/93
A Transport Protocol for Audio and Video Conferences and other
Multiparticipant Real-Time Applications
Status of this Memo
This document is an Internet Draft. Internet Drafts are working documents
of the Internet Engineering Task Force (IETF), its Areas, and its Working
Groups. Note that other groups may also distribute working documents as
Internet Drafts).
Internet Drafts are draft documents valid for a maximum of six months.
Internet Drafts may be updated, replaced, or obsoleted by other documents
at any time. It is not appropriate to use Internet Drafts as reference
material or to cite them other than as a "working draft" or "work in
progress."
Please check the I-D abstract listing contained in each Internet Draft
directory to learn the current status of this or any other Internet Draft.
Distribution of this document is unlimited.
Abstract
This draft discusses aspects of transporting real-time services
such as voice and video over the Internet. It compares and
evaluates design alternatives for a proposed real-time transport
protocol. Appendices touch on issues of port assignment and
multicast address allocation.
Acknowledgments
This draft is based on discussion within the AVT working group chaired by
Stephen Casner. Eve Schooler and Stephen Casner provided valuable comments.
INTERNET-DRAFT RTP October 27, 1992
This work was supported in part by the Office of Naval Research under
contract N00014-90-J-1293, the Defense Advanced Research Projects Agency
under contract NAG2-578 and a National Science Foundation equipment grant,
CERDCR 8500332.
Contents
1 Introduction 3
2 Goals 5
3 Services 8
3.1 Framing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Version Identification . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Conference Identification . . . . . . . . . . . . . . . . . . . . . 10
3.3.1Demultiplexing . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.2Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Media Encoding Identification . . . . . . . . . . . . . . . . . . . 11
3.4.1Audio Encodings . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4.2Video Encodings . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Playout Synchronization . . . . . . . . . . . . . . . . . . . . . . 14
3.5.1Synchronization Method . . . . . . . . . . . . . . . . . . . . . 18
3.5.2End-of-talkspurt indication . . . . . . . . . . . . . . . . . . 21
3.5.3Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 Segmentation and Reassembly . . . . . . . . . . . . . . . . . . . . 21
3.7 Source Identification . . . . . . . . . . . . . . . . . . . . . . . 22
3.7.1Gateways, Reflectors and End Systems . . . . . . . . . . . . . . 22
3.7.2Address Format Issues . . . . . . . . . . . . . . . . . . . . . 24
3.8 Energy Indication . . . . . . . . . . . . . . . . . . . . . . . . . 25
H. Schulzrinne Expires 4/1/93 [Page 2]
INTERNET-DRAFT RTP October 27, 1992
3.9 Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.10Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.10.1Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.10.2Authentication . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.11Quality of Service Control . . . . . . . . . . . . . . . . . . . . 27
4 Conference Control Protocol 28
5 Packet Format 28
5.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Control Packets . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A Port Assignment 31
B Multicast Address Allocation 34
C Glossary 36
D Address of Author 40
1 Introduction
The real-time transport protocol (RTP) discussed in this draft aims to
provide services commonly required by interactive multimedia conferences, in
particular playout synchronization, demultiplexing, media identification and
active-party identification. However, RTP is not restricted to multimedia
conferences; it is anticipated that other real-time services such as remote
data acquisition and control may find its services of use.
In this context, a conference describes associations that are characterized
by the participation of two or more agents, interacting in real time
with one or more media of potentially different types. The agents are
anticipated to be human, but may also be measurement devices, remote media
servers, simulators and the like. Both two-party and multiple-party
associations are to be supported, where one or more agents can take active
roles, i.e., generate data. Thus, applications not commonly considered a
conference fall under our wider definition, for example, one-way media such
as the network equivalent of closed-circuit television or radio, traditional
two-party telephone conversations or real-time distributed simulations.
H. Schulzrinne Expires 4/1/93 [Page 3]
INTERNET-DRAFT RTP October 27, 1992
Even though intended for real-time interactive applications, the use of
RTP for the storage and transmission of recorded real-time data should be
possible, with the understanding that the interpretation of some fields such
as timestamps may be affected by this off-line mode of operation.
RTP uses the services of an end-to-end transport protocol such as UDP,
TCP, OSI TPx, ST-II [1, 2] or the like(1). The services used are:
end-to-end delivery, framing, demultiplexing and multicast. The underlying
network is not assumed to be reliable and can be expected to lose, corrupt,
arbitrarily delay and reorder packets. However, the use of RTP within
quality-of-service (e.g., rate) controlled networks is anticipated to be of
particular interest. Network layer support for multicasting is desirable,
but not required. RTP is supported by a real-time control protocol (RTCP)
in a relationship similar to that between IP and ICMP. However, RTP can
function with reduced functionality without a control protocol. The control
protocol provides minimum functionality for maintaining conference state for
a single medium. It is not guaranteed to be reliable and assumed to be
multicast to all participants of a conference.
Conferences encompassing several media are managed by a (reliable)
conference control protocol, whose definition is outside the scope of this
note. Some aspects of its functionality, however, are described in
Section 4.
Within this working group, some common encoding rules and algorithms for
media should be specified, keeping in mind that this aspect is largely
independent of the remainder of the protocol. Without this specification,
interoperability cannot be achieved. It is suggested, however, to keep
the two aspects as separate RFCs as changes in media encoding should be
independent of the transport aspects. The encoding specification should
include things such as byte order for multi-byte samples, sample order
for multi-channel audio, the format of state information for differential
encodings, the segmentation of encoded video frames into packets, and the
like.
As part of this working group (or the conference architecture BOF/working
group), some number assignment issues will have to be addressed, in
particular for encoding formats, port and address usage. The issue of
port assignment will be discussed in more detail in Appendix A. It should
be emphasized, however, that UDP port assignment does not imply that all
underlying transport mechanisms share this or a similar port mechanism.
This draft aims to summarize some of the discussions held within the AVT
working group chaired by Stephen Casner, but the opinions are the author's
own. Where possible, references to previous work are included, but
the author realizes that the attribution of ideas is far from complete.
------------------------------
1. ST-II is not properly a transport protocol, as it is visible to
intermediate nodes, but it provides services such as process demultiplexing
commonly associated with transport protocols.
H. Schulzrinne Expires 4/1/93 [Page 4]
INTERNET-DRAFT RTP October 27, 1992
The draft builds on operational experience with Van Jacobson's and Steve
McCanne's vat audio conferencing tool as well as implementation experience
with the author's Nevot network voice terminal. This note will frequently
refer to NVP [3], the network voice protocol, the only such protocol
currently specified through an RFC within the Internet. CCITT has
standardized as recommendations G.764 and G.765 a packet voice protocol
stack for use in digital circuit multiplication equipment.
The name RTP was chosen to reflect the fact that audio-visual conferences
may not be the only applications employing its services, while the
real-time nature of the protocol is important, setting it apart from other
multimedia-transport mechanisms, such as the MIME multimedia mail effort
[4].
The remainder of this draft is organized as follows. Section 2 summarizes
the design goals of this real-time transport protocol. Then, Section 3
describes the services to be provided in more detail. Section 4 briefly
outlines some of the services added by the conference control protocol; a
more detailed description is outside the scope of this document. Given
the required services and design goals, Section 5 outlines possible packet
formats for RTP and RTCP. Two appendices discuss the issues of port
assignment and multicast address allocation, respectively. A glossary
defines terms and acronyms, providing references for further detail.
2 Goals
Design decisions should be measured against the following goals, not
necessarily listed in order of importance:
media flexibility: While the primary applications that motivate the
protocol design are conference voice and video, it should be
anticipated that other applications may also find the services provided
by the protocol useful. Some examples include distribution audio/video
(for example, the ``Radio Free Ethernet''application by Sun) and some
forms of (loss-tolerant) remote data acquisition. Note that it may be
possible that different media interpret the same packet header field in
different ways (e.g., a synchronization bit may be used to indicate
the beginning of a talkspurt for audio and the beginning of a frame
for video). Also, new formats of established media, for example,
high-quality multi-channel audio, should be anticipated where possible.
extensible: Researchers and implementors within the Internet community
are currently only beginning to explore real-time multimedia services
such as audio-visual conferences. Thus, the RTP should be
able to incorporate additional services as operational experience
with the protocol accumulates and as applications not originally
anticipated find its services useful. The same mechanisms
H. Schulzrinne Expires 4/1/93 [Page 5]
INTERNET-DRAFT RTP October 27, 1992
should also allow experimental applications to exchange application-
specific information without jeopardizing interoperability with other
applications. Extensibility is also desirable as it will hopefully
speed along the standardization effort, making the consequences of
leaving out some group's favorite fixed header field less drastic.
It should be understood that extensibility and flexibility may conflict
with the goals of bandwidth and processing efficiency.
independent of lower-layer protocols: RTP should make as few assumptions
about the underlying transport protocol as possible. It should, for
example, work reasonably well with UDP, TCP, ST-II, OSI TP, VMTP and
experimental protocols, for example, protocols that support resource
reservation and quality-of-service guarantees. Naturally, not all
transport protocols are equally suited for real-time services; in
particular, TCP may introduce unacceptable delays over anything but
low-error-rate LANs. Also, protocols that deliver streams rather than
packets needs additional framing services as discussed in Section 3.1.
It remains to be discussed whether RTP may use services provided by the
lower-layer protocols for its own purposes (time stamps and sequence
numbers, for example).
The goal of independence from lower-layer considerations also affects
the issue of address representation. In particular, anything too
closely tied to the current IP 4-byte addresses may face early
obsolescence. However, the charter of the working group is short term,
so that longer term changes in the host addressing can legitimately be
ignored.
gateway-compatible: Operational experience has shown that RTP-level
gateways are necessary and desirable for a number of reasons. First,
it may be desirable to aggregate several media streams into a single
stream and then retransmit it with possibly different encoding, packet
size or transport protocol. A reflector that achieves multicasting
by user-level copying may be needed where multicast tunnels are
unavailable or the end-systems are not multicast-capable.
bandwidth efficient: It is anticipated that the protocol will be used in
networks with a wide range of bandwidths and with a variety of media
encodings. Despite increasing bandwidths within the national backbone
networks, bandwidth efficiency will continue to be important for
transporting conferences across 56 kb links, office-to-home high-speed
modem connections and international links. To minimize end-to-end
delay and the effect of lost packets, packetization intervals have to
be limited, which, in combination with efficient media encodings, leads
to short packet sizes. Generally, packets containing 16 to 32 ms of
speech are considered optimal [5, 6, 7]. For example, even with a
65 ms packetization interval, a 4800 b/s encoding produces 39 byte
packets. Current Internet voice experiments use packets containing
between 20 and 22.5 ms of audio, which translates into 160 to 180 bytes
H. Schulzrinne Expires 4/1/93 [Page 6]
INTERNET-DRAFT RTP October 27, 1992
of audio information coded at 64 kb/s. Video packets are typically
much longer, so that header overhead is less of a concern.
For UDP multicast (without counting the overhead of source routing as
currently used in tunnels or a separate IP encapsulation as planned),
IPv4 incurs 20 bytes and UDP an additional 8 bytes of header overhead,
not counting any datalink layer headers of at least 4 bytes. With
RTP header lengths between 4 and 8 bytes, the total overhead amounts
to between 36 and 40 (or more) bytes per audio or video packet. For
160-byte audio packets, the overhead of 8-byte RTP headers together
with UDP, IP and PPP headers is 25%. For low bitrate coding, packet
headers can easily double the necessary bit rate.
Thus, it appears that any fixed headers beyond eight bytes would
have to make a significant contribution to the protocol's capabilities
to outweigh it standing in the way of running RTP applications over
low-speed links. The current fixed header lengths for NVP and vat are
4 and 8 bytes, respectively. It is interesting to note that G.764 has
a total header overhead, including the LAPD data link layer, of only 8
bytes, as the voice transport is considered a network-layer protocol.
The overhead is split evenly between layer 2 and 3.
Bandwidth efficiency can be achieved by transporting non-essential or
slowly changing protocol state in optional fields or in a separate
low-bandwidth control protocol. Also, header compression [8] may be
used.
international: Even now, audio and visual conferencing tools are used far
beyond the North American continent. It would seem appropriate to give
considerations to internationalization concerns, for example to allow
for the European A-law audio encoding and non-US-ASCII character sets
in textual data such as site identification.
processing efficient: At packet arrival rates of on the order of 40 to 50
per second for a single voice or video source, per-packet processing
overhead may become a concern, particularly if the protocol is to
be implemented on other than high-end platforms. Multiplication and
division operations should be avoided where possible and fields should
be aligned to their natural size, i.e., an n-byte integer is aligned on
an n-byte multiple, where possible.
implementable: Given the anticipated lifetime and experimental nature of
the protocol, it must be implementable with current hardware and
operating systems. That does not preclude that hardware and OS geared
towards real-time services may improve the performance or capabilities
of the protocol, e.g., allow better intermedia synchronization.
H. Schulzrinne Expires 4/1/93 [Page 7]
INTERNET-DRAFT RTP October 27, 1992
3 Services
The services that may be provided by RTP are summarized below. Note that
not all services have to be offered. Services anticipated to be optional
are marked with an asterisk.
o framing (*)
o demultiplexing by conference/association (*)
o demultiplexing by media source
o demultiplexing by media encoding
o synchronization between source(s) and destination(s)
o error detection (*)
o encryption (*)
o quality-of-service monitoring (*)
In the following sections, we will discuss how these services are reflected
in the proposed packet header. Information to be conveyed within the
conference can be roughly divided into information that changes with every
data packet and other information that stays constant for longer time
periods. State information that does not change with every packet can be
carried in several different ways:
as a fixed part of the RTP header: This method is easiest to decode and
ensures state synchronization between sender and receiver(s), but can
be bandwidth inefficient or restrict the amount of state information to
be conveyed.
as a header option: The information is only carried when needed. It
requires more processing by the sending and receiving application. If
contained in every packet, it is also less bandwidth-efficient than the
first method.
within RTCP packets: This approach is roughly equivalent to header options
in terms of processing and bandwidth efficiency. Some means of
identifying when a particular option takes effect within the data
stream may have to be provided.
within conference control: The state information is conveyed when the
conference is established or when the information changes. As for RTCP
packets, a synchronization mechanism between data and control may be
H. Schulzrinne Expires 4/1/93 [Page 8]
INTERNET-DRAFT RTP October 27, 1992
required for certain information.
through a conference directory: This is a variant of the conference control
mechanism, with a (distributed) directory at a well-known location
maintaining state information about on-going or scheduled conferences.
Changing state information during a conference is probably more
difficult than with conference control as participants need to be told
to look at the directory for changed information. Thus, a directory
is probably best suited to hold information that will persist through
the life of the conference, for example, its multicast group, title and
organizer.
The first two methods are examples of in-band signaling, the others of
out-of-band signaling.
3.1 Framing
To satisfy the goal of transport independence, we cannot assume that the
lower layer provides framing. (Consider TCP as an example, even though it
would probably not be used for real-time applications except possibly on
a local network, but may be used in distributing recorded audio or video
segments.) Thus, if and only if the underlying protocol does not provide
framing, the RTP packet is prefixed by a 16-bit byte count. The byte count
could also be used by mutual agreement if it is deemed desirable to carry
several RTP packets in a single TPDU for increased efficiency.
3.2 Version Identification
Humility suggests that we anticipate that we may not get the first
iteration of the protocol right. In order to avoid ``flag days'' where
everybody shifts to a new protocol, a version identifier could ensure
continued interoperability. This is particularly important since UDP, for
example, does not carry a ``next protocol'' identifier. The difficulty in
interworking between the current vat and NVP protocols further affirms the
necessity of a version identifier. However, the version identifier can be
anticipated to be the most static of all proposed header fields. Since the
length of the header and the location and meaning of the option length field
may be affected by a version change, encoding the version within an optional
field is not feasible.
Putting the version number into the control protocol packets would make RTCP
mandatory and would make rapid scanning of conferences significantly more
difficult.
vat currently offers a 2-bit version field, while this capability is missing
from NVP. Given the low bit usage and their utility in other contexts (IP,
H. Schulzrinne Expires 4/1/93 [Page 9]
INTERNET-DRAFT RTP October 27, 1992
ST-II), it may be prudent to include a version identifier.
3.3 Conference Identification
A conference identifier (conference ID) could serve two mutually exclusive
functions: providing another level of demultiplexing or a means of
logically aggregating flows with different network addresses and port
numbers. vat specifies a 16-bit conference identifier.
3.3.1 Demultiplexing
Demultiplexing by RTP allows one association characterized by destination
address and port number to carry several distinct conferences. However,
this appears to be necessary only if the number of conferences exceeds the
demultiplexing capability available through (multicast) addresses and port
numbers.
Efficiency arguments suggest that combining several conferences or media
within a single multicast group is not desirable. Combining several
conferences or media within a single multicast address negates the bandwidth
efficiency afforded by multicasting. Also, applications that are not
interested in a particular conference or capable of dealing with particular
medium are still forced to handle the packets delivered for that conference
or medium. Consider as an example two separate applications, one for
audio, one for video. If both share the same multicast address and port,
being differentiated only by the conference identifier, the operating system
has to copy each incoming audio and video packet into two application
buffers and perform a context switch to both applications, only to have one
immediately discard the incoming packet.
Given that application-layer demultiplexing has strong negative efficiency
implications and given that multicast addresses are not an extremely
scarce commodity, there seems to be no reason to burden every application
with maintaining and checking conference identifiers for the purpose of
demultiplexing.
It is also not recommended to use this field to distinguish between
different encodings, as it would be difficult for the application to decide
whether a new conference identifier means that a new conference has arrived
or simply all participants should be moved to the new conference with a
different encoding. Since the encoding may change for some but not all
participants, we could find ourselves breaking a single logical conference
into several pieces, with a fairly elaborate control mechanism to decide
which conferences logically belong together.
H. Schulzrinne Expires 4/1/93 [Page 10]
INTERNET-DRAFT RTP October 27, 1992
3.3.2 Aggregation
Particularly within a network with a wide range of capacities, differing
multicast groups for each media component of a conference allows to
tailor the media distribution to the network bandwidths and end-system
capabilities. It appears useful, however, to have a means of identifying
groups that logically belong together, for example for purposes of time
synchronization.
A conference identifier used in this manner would have to be globally
unique. It appears that such logical connections would better be identified
as part of the control protocol by identifying all multicast addresses
belonging to the same logical conference, thereby avoiding the assignment of
globally unique identifiers.
3.4 Media Encoding Identification
This field plays a similar role to the protocol field in data link
or network protocols, indicating the next higher layer (here, the media
decoder) that the data is meant for. For RTP, this field would indicate the
audio or video or other media encoding. In general, the number of distinct
encodings should be kept as small as possible to increase the chance that
applications can interoperate. A new encoding should only be recognized
if it significantly enhances the range of media quality or the types of
networks conferences can be conducted over. The unnecessary proliferation
of encodings can be reduced by making reference implementations of standard
encoders and decoders widely available.
It should be noted that encodings may not be enumerable as easily as, say,
transport protocols. A particular family of related encoding methods may
be described by a set of parameters, as discussed below in the sections on
audio and video encoding.
Encodings may change during the duration of a conference. This may be
due to changed network conditions, changed user preference or because the
conference is joined by a new participant that cannot decode the current
encoding. If the information necessary for the decoder is conveyed
out-of-band, some means of indicating when the change is effective needs to
be incorporated. Also, the indication that the encoding is about to change
must reach all receivers reliably before the first packet employing the new
encoding. Each receiver needs to track pending changes of encodings and
check for every incoming packet whether an encoding change is to take effect
with this packet.
Conveying media encodings rapidly is also important to allow scanning of
conferences or broadcast media. A directory service could provide encoding
information for on-going conferences. This may not be sufficient, however,
unless all participants within a conference use the same encoding. Also,
H. Schulzrinne Expires 4/1/93 [Page 11]
INTERNET-DRAFT RTP October 27, 1992
the usual synchronization problems between transmitted data and directory
information apply.
There are at least two approaches to indicating media encoding, either
in-band or out-of-band:
conference-specific: Here, the media identifier is an index into a table
designating the approved or anticipated encodings (together with any
particular version numbers or other parameters) for a particular
conference or user community. The table can be distributed through
RTCP, a conference control protocol or some other out-of-band means.
Since the number of encodings used during a single conference is likely
to be small, the field width in the header can likewise be small.
Also, there is no need to agree on an Internet-wide list of encodings.
It should be noted that conveying the table of encodings through RTCP
forces the application to maintain a separate mapping table for each
sender as there can be no guarantee that all senders will use the same
table.
global: Here, the media identifier is an index into a global table
of encodings. A global list reduces the need for out-of-band
information. Transmitting the parameters associated with an encoding
may be difficult, however, if it has to be done within the header space
constraints of per-packet signaling.
To make detecting coder mismatches easier, encodings for all media should
be drawn from the same numbering space. To facilitate experimentation with
new encodings, a part of any global encoding numbering space should be
set aside for experimental encodings, with numbers agreed upon within the
community experimenting with the encoding, with no Internet-wide guarantee
of uniqueness.
3.4.1 Audio Encodings
Audio data is commonly characterized by three independent descriptors:
encoding (the translation of one or more audio samples into a channel
symbol), the number of channels (mono, stereo) and the sampling rate.
Theoretically, sampling rate and encoding are (largely) independent. We
could, for example, apply =-law encoding to any sampling rate even though
it is traditionally used with a rate 8,000 Hz. In practical terms, it may
be desirable to limit the combinations of encoding and sampling rate to the
values the encoding was designed for. (2).
------------------------------
2. Given the wide availability of =-law encoding and its low overhead,
using it with a sampling rate of 16,000 or 32,000 Hz might be quite
H. Schulzrinne Expires 4/1/93 [Page 12]
INTERNET-DRAFT RTP October 27, 1992
Channel counts between 1 and 4 should be sufficient and can be encoded into
2 bits by encoding the channel count minus one.
The audio encodings listed in Table 1 appear particularly interesting,
even though the list is by no means exhaustive and does not include some
experimental protocols currently in use, for example a non-standard form
of LPC. The bit rate is shown per channel. ks/s, b/sample and kb/s
denote kilosamples per second, bits per sample and kilobits per second,
respectively. If sampling rates are to be specified separately, the values
of 8, 16, 32, 44.1, and 48 kHz suggest themselves, even though other
values (11.025 and 22.05 kHz) are supported on some workstations (the
Silicon Graphics audio hardware and the Apple Macintosh, for example).
Clearly, little is to be gained by allowing arbitrary sampling rates, as
conversion particularly between rates not related by simple fractions is
quite cumbersome and processing-intensive.
_Org.______Name_____k_samples/s__bits/sample__kb/s__description_______
CCITT G.711 8.0 8 64 =-law PCM
CCITT G.711 8.0 8 64 A-law PCM
CCITT G.721 8.0 4 32 ADPCM
Intel DVI 8.0 4 32 APDCM
CCITT G.723 8.0 3 24 ADPCM
CCITT G.726 ADPCM
CCITT G.727 ADPCM
NIST/GSA FS 1015 8.0 2.4 LPC-10E
NIST/GSA FS 1016 8.0 4.8 CELP
NADC IS-54 8.0 7.95 VSELP
CCITT G.7xy 8.0 16 LD-CELP
GSM 8.0 13 RPE-LPC
CCITT G.722 8.0 64 7 kHz, SB-ADPCM
256 MPEG audio
32.0 16 512 DAT
44.1 16 705.6 CD, DAT playback
48.0 16 786 DAT record
Table 1: Standardized and common audio encodings
------------------------------
appropriate for high-quality audio conferences, even though there are other
encodings, such as G.722, specifically designed for such applications. Note
that the signal-to-noise ratio of =-law encoding is about 38 dB, equivalent
to an AM receiver. The ``telephone quality'' associated with G.711 is due
primarily to the limitation in frequency response to the 200 to 3500 Hz
range.
H. Schulzrinne Expires 4/1/93 [Page 13]
INTERNET-DRAFT RTP October 27, 1992
3.4.2 Video Encodings
Common video encodings are listed in Table 2. Encodings with tunable rate
can be configured for different rates, but produce a fixed-rate stream.
The average bit rate produced by variable-rate codecs depends on the source
material.
_Org.________name___rate_______________remarks___________
CCITT JPEG tunable
CCITT MPEG variable, tunable
CCITT H.261 tunable
Bolter variable, tunable
PictureTel ??
BBN DVC variable, tunable block differences
Table 2: Common video encodings
3.5 Playout Synchronization
A major purpose of RTP is synchronization between the source and sink(s) of
a single medium. Note that this is to be distinguished from synchronization
between different media such as audio and video (lip sync). Sometimes the
two forms are referred to as intra-media and inter-media synchronization.
RTP concerns itself only with intra-media or playout synchronization,
although the mechanisms such as timestamps may be necesary for inter-media
synchronization.
In connection with playout synchronization, we can group packets into
playout units, a number of which in turn form a synchronization unit. More
specifically, we define:
synchronization unit: A synchronization unit consists of one or more
playout units (see below) that, as a group, share a common fixed delay
between generation and playout of each part of the group. The delay
may change at the beginning of such a synchronization unit. The most
common synchronization units are talkspurts for voice and frames for
video transmission.
playout unit: A playout unit is a group of packets sharing a common
timestamp. (Naturally, packets whose timestamps are identical due
to timestamp wrap-around are not considered part of the same playout
unit.) For voice, the playout unit would typically be a single voice
segment, while for video a video frame could be broken down into
subframes, each consisting of packets sharing the same timestamp and
ordered by some form of sequence number.
H. Schulzrinne Expires 4/1/93 [Page 14]
INTERNET-DRAFT RTP October 27, 1992
All proposed synchronization methods require a timestamp. The timestamp
has to have a sufficient range that wrap-arounds are infrequent. It
is desirable that the range exceeds the maximum expected inactive (e.g.,
silence) period. Otherwise, special handling may be necessary in the case
of the sequence number/time stamp combination as the beginning of the next
active period could have a time stamp one greater than the last one, thus
masking the beginning of the talkspurt. The 10-bit timestamp used by NVP is
generally agreed to be too small as it wraps around after only 20.5 s (for
20 ms audio packets), while a 32-bit timestamp should serve all anticipated
needs, even if the timestamp is expressed in units of samples or other
sub-packet entities.
Three proposals as to the interpretation of the timestamp have been
advanced:
packet/frame: Each packetization or (video/audio) frame interval increments
the timestamp. This approach very efficient in terms of processing
and bit-use, but cannot be used without out-of-band information if
the time interval of media ``covered'' by a packet varies from packet
to packet. This occurs for example with variable-rate encoders or
if the packetization interval is changed during a conference. This
interpretation of a timestamp is assumed by NVP, which defines a frame
as a block of PCM samples or a single LPC frame. Note that there
is no inherent necessity that all participants within a conference use
the same packetization interval. Local implementation considerations
such as available clocks may suggest other intervals. As another
example, consider a conference with feedback. For the lecture audio, a
long packetization interval may be desirable to better amortize packet
headers. For side chats, delays are more important, thus suggesting a
shorter packetization interval.(3)
sample: This method simply counts samples, allowing a direct translation
between time stamp and playout buffer insertion point. It is just
as easily computable as the per-packet timestamp. However, for some
media and encodings(4), it may not be quite clear what a sample is.
Also, some care must be taken at the receiver if incoming streams use
different sampling rates. This method is currently used by vat.
subset of NTP timestamp: 16 bits encode seconds relative to 0 o'clock,
------------------------------
3. Nevot for example, allows each participant to have a different
packetization interval, independent of the packetization interval used by
Nevot for its outgoing audio. Only the packetization interval for outgoing
audio for all conferences must be the same.
4. Examples include frame-based encodings such as LPC and CELP. Here, given
that these encodings are based on 8,000 Hz input samples, the preferred
interpretation would probably be in terms of audio samples, not frames, as
samples would be used for reconstruction and mixing.
H. Schulzrinne Expires 4/1/93 [Page 15]
INTERNET-DRAFT RTP October 27, 1992
January 1, 1900 (modulo 65536) and 16 bits encode fractions of a
second, with a resolution of approximately 15.2 =s, which is smaller
than any anticipated audio sampling or video frame interval. This
timestamp is the same as the middle 32 bits of the 64-bit NTP
timestamp [9]. It wraps around every 18.2 hours. If it should be
desirable to reconstruct absolute transmission time at the receiver for
logging or recording purposes, it should be easy to determine the most
significant 16 bits of the timestamp. Otherwise, wrap-arounds are not
a significant problem as long as they occur 'naturally', i.e., at a 16
or 32 bit boundary, so that explicit checking on arithmetic operations
is not required. Also, since the translation mechanism would probably
treat the timestamp as a single integer without accounting for its
division into whole and fractional part, the exact bit allocation
between seconds and fractions thereof is less important. However, the
16/16 approach simplifies extraction from a full NTP timestamp.
The NTP-like timestamp has the disadvantage that its resolution does
not map into any of the common sample intervals. Thus, there is
a potential uncertainty of one sample at the receiver as to where to
place the beginning of the received packet, resulting in the equivalent
of a one-sample slip. CCITT recommendation G.821 postulates a mean
slip rate of less than 1 slip in 5 hours, with degraded but acceptable
service for less than 1 slip in 2 minutes. Tests with appropriate
rounding conducted by the author showed that this most likely does
not cause problems. In any event, a double-precision floating
point multiplication is needed to translate between this timestamp and
the integer sample count available on transmission and required for
playout.(5)
It has been suggested to use timestamps relative to the beginning of
first transmission from a user. This makes correlation between media
from different participants difficult and seems to have no technical or
implementation advantages, except for avoiding wrap-around during most
conferences. As pointed out above, that seems to be of little benefit.
Clearly, the reliability of a wallclock-synchronized timestamps depends
on how closely the system clocks are synchronized, but that does not
argue for giving up potential real-time synchronization in all cases.
It also needs to be decided whether the time stamp should reflect
real time or sample time. A real time timestamp is defined to track
wallclock time plus or minus a constant offset. Sample time increases
by the nominal sampling interval for each sample. The two clocks in
general do not agree since the clock source used for sampling will in
all likelihood be slightly off the nominal rate. For example, typical
crystals without temperature control are only accurate to 50 -- 100
------------------------------
5. The multiplication with an appropriate factor can be approximated
to the desired precision by an integer multiplication and division, but
multiplication by a floating point value is generally much faster on modern
processors.
H. Schulzrinne Expires 4/1/93 [Page 16]
INTERNET-DRAFT RTP October 27, 1992
ppm (parts per million), yielding a potential drift of 0.36 seconds per
hour between the sampling clock and wallclock time.
Using real time rather than sample time allows for easier
synchronization between different media and to compensate for slow or
fast sample clocks. Note that it is neither desirable nor necessary
to obtain the wall clock time when each packet was sampled. Rather,
the sender determines the wallclock time at the beginning of each
synchronization unit (e.g., a talkspurt for voice and a frame for
video) and adds the nominal sample clock duration for all packets
within the talkspurt to arrive at the timestamp value carried in
packets. The real time at the beginning of a talkspurt is determined
by estimating the true sample rate for the duration of the conference.
The sample rate estimate has to be accurate enough to allow placing
the beginning of a talkspurt, say, to within at most 50 to
100 ms, otherwise the lack of synchronization may be noticeable,
delay computations are confused and successive talkspurts may be
concatenated.
Estimating the true sampling instant to within a few milliseconds is
surprisingly difficult for current operating systems. The sample rate
r can to be estimated as
r =_s+_q_t:-t0
Here, t is the current time, t0 the time elapsed since the first
sample was acquired, s is the number of samples read, q is the
number of samples ready to be read (queued) at time t. Then, the
timestamp to be inserted into the synchronization packet is computed as
t0+ tr. Unfortunately, only s is known precisely. The accuracy of
the estimate for t0 and t depend on how accurately the beginning of
sampling and the last reading from the audio device can be measured.
There is a non-zero probability that the process will get preempted
between the time the audio data is read and the instant the system
clock is sampled. It remains unclear whether indications of current
buffer occupancy, if available, can be trusted. Experiments with
the SunOS audio driver showed significant variations of the estimated
sample rate, with discontinuities of the computed timestamps of up to
25 ms. Kernel support is probably required for meaningful real time
measurements.
Sample time increments with the sampling interval for every sample
or (sub)frame received from the audio or video hardware. It is
easy to determine, as long as care is taken to avoid cumulative
round-off errors incurred by simply repeatedly adding the approximate
packetization interval. However, synchronization between media and
end-to-end delay measurements are then no longer feasible. (Example:
Consider an audio and video stream. If the audio sample clock is
slightly faster than the real clock and the video sampling clock, a
video and audio frame belonging together would be marked by different
timestamps, thus played out at different instants.)
H. Schulzrinne Expires 4/1/93 [Page 17]
INTERNET-DRAFT RTP October 27, 1992
If we are forced to use sample time, the advantage of using an NTP
timestamp disappears, as the receiver can easily reconstruct a NTP
sample-based timestamp from the sample count if needed, but would not
have to if no cross-media synchronization is required. RTCP could
relate the time increment per sample in full precision.
It should be noted that it may not be possible to associate an
meaningful notion of time with every packet. For example, if a
video frame is broken into several fragments, there is no natural
timestamp associated with anything but the first fragment, particularly
if there is not even a sequential mapping from screen scan location
into packets. Thus, any timestamp used would be purely artificial.
A synchronization bit could be used in this particular case to mark
beginning of synchronization units. For packets within synchronization
units, there are two possible approaches: first, we can introduce an
auxiliary sequence number that is only used to order packets within a
frame. Secondly, we could abuse the timestamp field by incrementing
it by a single unit for each packet within the frame, thus allowing a
variable number of frames per packet. The latter approach is barely
workable and rather kludgy.
3.5.1 Synchronization Method
Timestamp/sequence number: This method is currently used by NVP. The
sequence number is incremented with every transmitted packet. For
audio, the beginning of a talkspurt is indicated when successive
packets differ in timestamp more than they differ in sequence number.
As long as packets are not reordered, determination of the beginning of
a talkspurt is generally easy, except for the unlikely case where a new
talkspurt has a time stamp that, due to timestamp wrap-around, is one
greater than the last packet of the previous talkspurt.
However, if packets are reordered, delay adaptation at the beginning
of a talkspurt becomes unreliable. Consider the scenario laid out in
Table 3. For convenience, the example assumes that clocks at the
transmitter and receiver are perfectly synchronized; also, timestamps
are expressed in wallclock time, increasing by 20 time units for each
packet. The current playout delay, that is, the jitter estimate, is
set at 50 time units and is assumed to stay constant throughout the
example. In the table, packet 210 is recognized as the beginning
of a new talkspurt if there has been no reordering. If packets 210
and 211 arrive in reverse transmission order, the receiver can only
conclude that packet 211 introduces a new talkspurt. Because the
wrong packet is treated as the beginning of a talkspurt, the playout
delay is really one packetization interval too short for the remainder
of the talkspurt. In the example, packet 212 arrives too late
and misses its playout time, even though it would have made playout
without reordering. This scenario assumes that packets are mixed in
at the time of arrival so that their playout time cannot be changed.
H. Schulzrinne Expires 4/1/93 [Page 18]
INTERNET-DRAFT RTP October 27, 1992
It is possible to relax that assumption and reschedule packets after
discovering that the wrong packet was used as the talkspurt beginning;
this, however, would seem to complicate the implementation greatly, as
determining how long the mixing is to be delayed cannot be readily
decided. Unfortunately, reordering at the beginning of a talkspurt is
particularly likely since common silence detection algorithms send a
group of packets to prevent front clipping.
no reordering with reordering
seq. timestamp arrival playout arrival playout
200 1020 1520 1570 1520 1570
201 1040 1530 1590 1530 1590
210 1220 1720 1790 1725 1770
211 1240 1725 1810 1720 1790
212 1260 1825 1830 1825 1810
Table 3: Example where out-of-order arrival leads to packet loss
timestamp/synchronization bit: This method is currently used by vat.
Here, the beginning of a talkspurt is indicated by setting the
synchronization bit. A sequence number is not required. This
synchronization method is unaffected by out-of-order packet delivery.
If the first packet of a talkspurt is lost, two talkspurts are
simply merged, without dire consequences except for a missed chance
to have the playout delay reflect the delay jitter estimate. The
synchronization bit has to be ignored if a packet with a larger
timestamp has already arrived.
The insertion rule can thus be expressed as
ae
ln= l1+p(tn-t1+)dmforanx>1forn =1 (1)
where ln denotes the location within the playout buffer for packet n
within a talkspurt, tn the timestamp of packet n within a talkspurt,
p the current playout location (the read pointer) and dmax the
current estimated playout delay, that is, the estimated maximum delay
variation. All quantities are measured in appropriate units (time,
samples, or bytes). Addition is performed modulo the buffer size.
The role of the synchronization bit for packet video remains to be
defined. It does not have to bear any relationship to the content,
e.g., frame structure of a packet video source, as it merely indicates
where delay can be varied without affecting perceived quality.
The disadvantage of this scheme is that it is impossible for the
receiver to get an accurate count of the number of packets that
it should have received. While gaps within a talkspurt give some
indication of packet loss, we cannot tell what part of the tail of a
H. Schulzrinne Expires 4/1/93 [Page 19]
INTERNET-DRAFT RTP October 27, 1992
talkspurt has been transmitted. (Example: consider the talkspurts
with time stamps 100, 101, 102, 110, 111, where packets with timestamp
100 and 110 have the synchronization bit set. At the receiver, we
have no way of knowing whether we were supposed to have received two
talkspurts with a total of five packets, or two or more talkspurts with
up to 12 packets.) We can overcome this difficulty by enhancing RTCP
as discussed in Section 3.11.
synchronization bit/sequence number within talkspurt: G.764 implements this
method. The sequence number zero is reserved for the first packet of
a talk spurt, while sequence numbers 1 through 15 are used for the
remaining packets within the talkspurt, wrapping around from 15 to 1,
if necessary. This is equivalent to the synchronization bit described
earlier. A sequence number gap also triggers a new talkspurt. The
scheme is designed for networks that cannot reorder packets. With
reordering, packets may easily be played out in the wrong order.
Consider, for example, packets with sequence numbers 0, 1, and 2. If
the packets arrive in the order 1, 2, 0, the receiver interprets
this as a two talkspurts and plays the packets in the order received.
From the example, we can generalize that sequence numbers that number
packets within a talkspurt are not suitable for networks that can
reorder packets if used without timestamps.
G.764 also features a delay accumulator field, into which each node
adds the queueing and processing delay accumulated at that node. A
one-byte field is used to encode delays between 0 and 200 ms with a
resolution of 1 ms. The resolution of 1 ms suffices since the delay
estimate affects only the placement of the beginning of a talkspurt.
Note that the synchronization mechanism does not depend on this delay
value. The delay value does, however, allow the application to gauge
how congested the underlying network is. With a delay estimate,
equation (1) changes so that
l1 =p +dmax- d1
Thus, the end-to-end delay is the maximum variable delay plus the fixed
delay, rather than the sum of estimated maximum variable delay, the
fixed delay and the variable delay experienced by the first packet in
the talkspurt. Thus, the end-to-end delay is lower without affecting
the late loss probability. The delay accumulator could be used for any
of the synchronization schemes described here.
Despite this benefit, its use within the Internet appears impossible,
as we cannot expect routers to update a field in an application layer
protocol like RTP.
H. Schulzrinne Expires 4/1/93 [Page 20]
INTERNET-DRAFT RTP October 27, 1992
3.5.2 End-of-talkspurt indication
An end-of-talkspurt indication is useful to distinguish silence from lost
packets. The receiver would want to replace silence by an appropriate
background noise level to avoid the ``noise-pumping'' associated with
silence detection. On the other hand, missing packets should be
reconstructed from previous packets. If the silence detector makes use
of hangover, the transmitter can easily set the end-of-talkspurt indicator
on the last bit of the last hangover packet. If the talkspurts follow
end-to-end, the end-of-talkspurt indicator has no effect except in the
case where the first packet of a talkspurt is lost. In that case, the
indicator would erroneously trigger noise fill instead of loss recovery.
The end-of-talkspurt indicator is implemented in G.764 as a ``more'' bit
which is set to one for all but the last packet within a talkspurt.
3.5.3 Recommendation
Given the ease of cross-media synchronization, the media independence
(except for the sub-frame aspect mentioned), the use of 32-bit 16/16
timestamps representing the middle part of the NTP timestamp is suggested.
Generally, a real-time based timestamp appears to be preferable to a
sample-based one, but it may not be realizable for some current operating
systems. Inter-media synchronization has to await mechanisms that can
accurately determine when a particular sample was actually received by
the A/D converter. Given the lower overhead and the ease of playout
reconstruction, a synchronization bit appears preferable to the sequence
number/time stamp combination. Since sequence numbers are useful for cases
where packets do not carry meaningful timing information and also ease loss
detection, they should be provided for space permitting.
3.6 Segmentation and Reassembly
For high-bandwidth video, a single frame may not fit into the maximum
transport unit (MTU). Thus, some form of frame sequence number is needed.
If possible, the same sequence number should be used for synchronization and
fragmentation. Four possibilities suggest themselves:
overload timestamp: No sequence number is used. Within a frame, the
timestamp has no meaning. Since it is used for synchronization only
when the synchronization bit is set, the other timestamps can just
increase by one for each packet. However, as soon as the first
frame gets lost or reordered, determining positions and timing becomes
difficult or impossible.
H. Schulzrinne Expires 4/1/93 [Page 21]
INTERNET-DRAFT RTP October 27, 1992
continuous: The sequence number is incremented without regard to frame
boundaries. If a frame consists of a variable number of packets,
it may not be clear what position the packet occupies within the
frame if packets are lost or reordered. Continuous sequence numbers
make it possible to determine if all packets for a particular frame
have arrived, but only after the first packet of the next frame,
distinguished by a new timestamp, has arrived.
within frame: Naturally, this approach has properties complementary to the
first.
continuous with first-packet option: Packets use a continuous sequence
number plus an option in every packet indicating the initial sequence
number within the playout unit(6). Carrying both a continuous and
packet-within-frame count achieves the same effect.
continuous with last-packet option: Packets carry a continuous sequence
number plus an option in every packet indicating the last sequence
number within the playout unit. This has the advantage that the
receiver can readily detect when the last packet packet for a playout
unit has been received. The transmitter may not know, however, at the
beginning of a playout unit how many packets it will comprise. Also,
the position within the playout unit is more difficult to determine if
the initial packet is lost.
It could be argued that encoding-specific location information should be
contained within the media part, as it will likely vary in format and use
from one media to the next.
3.7 Source Identification
3.7.1 Gateways, Reflectors and End Systems
It is necessary to be able to identify the origin of the real-time data in
terms meaningful to the application. First, this is required to demultiplex
sites (or sources) within the same conference. Secondly, it allows an
indication of the currently active source.
Currently, NVP makes no explicit provisions for this, assuming that the
network source address can be used. This may fail if intermediate agents
intervene between the media source and final destination. Consider the
example in Fig. 1. An RTP-level gateway is defined as an entity that
transforms either the RTP header or the RTP media data or both. Such
a gateway could for example merge two successive packets for increased
------------------------------
6. suggested by Steve Casner
H. Schulzrinne Expires 4/1/93 [Page 22]
INTERNET-DRAFT RTP October 27, 1992
transport efficiency or, probably the most common case, translate media
encodings for each stream, say from PCM to LPC (called transcoding).
A synchronizing gateway is defined here as a gateway that recreates a
synchronous media stream, possibly after mixing several sources. An
application that mixes all incoming streams for a particular conference,
recreates a synchronous audio stream and then forwards it to a set of
receivers is an example of a synchronizing gateway. A synchronizing gateway
could be built from two end system applications, with the first application
feeding the media output to the media input of the second application and
vice versa.
In figure 1, the gateways are used to translate audio encodings, from PCM
and ADPCM to LPC. The gateway could be either synchronizing or not. Note
that a resynchronizing gateway is only necessary if audio packets depend on
their predecessors and thus cannot be transcoded independently. It may be
advantageous if the packetization interval can be increased. Also, for
connections that are barely able to handle one active source at a time,
mixing at the gateway avoids excessive queueing delays when several sources
are active at the same time. A synchronizing gateway has the disadvantage
that it always increases the end-to-end delay.
We define reflectors as transport-level entities that translate between
transport protocols, but leave the RTP protocol unit untouched. In the
figure, the reflector connects a multicast group to a group of hosts that
are not multicast capable by performing transport-level replication.
We define an end system as an entity that receives and generates media
content, but does not forward it.
We define three types of sources: the media source is the actual origins of
the media, e.g., the talker in an audiocast; a synchronization source is the
combination of several media sources with its own timing; network source is
the network-level origin as seen by the end system receiving the media.
The end system has to synchronize its playout with the synchronization
source, indicate the active party according to the media source and return
media to the network source. If an end system receives media through a
resynchronizing gateway, the end system will see the gateway as the network
and synchronization source, but the media sources should not be affected.
The reflector does not affect the media or synchronization sources, but
the reflector becomes the network source. (Note that having the reflector
change the IP source address is not possible since the end systems need to
be able to return their media to the reflector.)
vat audio packets include a variable-length list of at most 64 4-byte
identifiers containing all media sources of the packet. However, there is
no convenient way to distinguish the synchronization source from the network
source. The end system needs to be able to distinguish synchronization
sources because jitter computation and playout delay differ for each
synchronization source.
H. Schulzrinne Expires 4/1/93 [Page 23]
INTERNET-DRAFT RTP October 27, 1992
/-------\ +------+
_ _ ADPCM _ _
_ group _<------>_ GW _--\ LPC
_ _ _ _ \ /------ end system
\-------/ +------+ \_\/
reflector _ >------- end system
/-------\ +------+ /_/\
_ _ PCM _ _ / \------ end system
_ group _<------>_ GW _--/ LPC
_ _ _ _
\-------/ +------+
<---> multicast
Figure 1: Gateway topology
Rather than having the gateway (which may be unaware of the existence of
a reflectors down stream) insert a synchronization source identifier or
having the reflector know about the internal structure of RTP packets, the
current ad-hoc encapsulation solution used by Nevot may be sufficient: the
reflector simply prefixes the the true network address (and port?) of the
last source (either the gateway or media source, i.e., the synchronization
source) to the RTP packet. Thus, each end system and gateway has to
be aware whether it is being served by a reflector. Also, multiple
concatenated reflectors are difficult to handle.
3.7.2 Address Format Issues
The limitation to four bytes of addressing information may not be desirable
for a number of reasons. Currently, it is used to hold an IP address. This
works as long as four bytes are sufficient to hold an identifier that is
unique throughout the conference and as long as there is only one media
source per IP address. The latter assumption tends to be true for many
current workstations, but it is easy to imagine scenarios where it might not
be, e.g., a system could hold a number of audio cards, could have several
audio channels (Silicon Graphics systems, for example) or could serve as a
multi-line telephone interface.(7)
The combination of IP address and source port can identify multiple sources
per site if each media source uses a different network source port. It
does not seem appropriate to force applications to allocate ports just to
distinguish sources. In the PBX example a single output port would appear
------------------------------
7. If we are willing to forego the identification with a site, we could
have a multiple-audio channel site pick unused IP addresses from the local
network and associate it with the second and following audio ports.
H. Schulzrinne Expires 4/1/93 [Page 24]
INTERNET-DRAFT RTP October 27, 1992
to be the appropriate method for sending all incoming calls across the
network.
Given the discussion of longer address formats at least in the longer term,
it seems appropriate to consider allowing for variable-length identifiers.
Ideally, the identifier would identify the agent, not a computer or network
interface.(8) A currently viable implementation is the concatenation of
the IP address and some locally unique number. The meaning of the local
discriminator is opaque to the outside world; it appears to be generally
easier to have a local unique id service than a distributed version thereof.
Possibilities for the local discriminator include the numeric process
identifier (plus some distinguishing information within the application),
the network source port number or a numeric user identifier.
For efficiency in the common case of one source per workstation, the
convention (used in vat) of using the network source address, possibly
combined with the user id or source port, as media and synchronization
source should be maintained.
3.8 Energy Indication
G.764 contains a 4-bit noise energy field, which encodes the white noise
energy to be played by the receiver in the silences between talkspurts.
Playing silence periods as white noise reduces the noise-pumping where the
background noise audible during the talkspurt is audibly absent at the
receiver during silence periods. Substituting white noise for silence
periods at the receiver is not recommended for multi-party conferences, as
the summed background noise from all silent parties would be distractive.
Determining the proper noise level appears to be difficult. It is suggested
that the receiver simply takes the energy of the last packet received before
the beginning of a silence period as an indication of the background noise.
With this mechanism, an explicit indication in the packet header is not
required.
3.9 Error Control
It remains to be decided whether the header, the whole packet or neither
should be protected by checksums. NVP protects its header only, while G.764
------------------------------
8. In the United States, a one way encryption function applied to
the social security number would serve to identify human agents without
compromising the SSN itself, given that the likelihood of identical SSNs is
sufficiently small. The use of a telephone number may be less controversial
and is applicable world-wide, but may require some local coordination if
numbers are shared.
H. Schulzrinne Expires 4/1/93 [Page 25]
INTERNET-DRAFT RTP October 27, 1992
has a single 16-bit check sequence covering both datalink and packet voice
header. However, if UDP is used as the transport protocol, a checksum over
the whole packet is already computed by the receiver. (Checksumming for UDP
can typically be disabled by the sending or receiving host.) ST-II does
not compute checksums for either header or data. Many data link protocols
already discard packets with bit errors, so that packets are rarely rejected
due to higher-layer checksums.
Bit errors within the data part are probably easier to tolerate than a lost
packet, particularly since some media encoding formats may provide built-in
error correction. The impact of bit errors within the header can vary;
for example, errors within the timestamp may cause the audio packet to be
played out at the wrong time, probably much more noticeable than discarding
the packet. Other noticeable effects are caused by a wrong conference ID
or false encoding (if present). If a separate checksum is desired for the
cases where the underlying protocols do not already provide one, it should
be optional. Once optional, it would be easy to define several checksum
options, covering just the header, the header plus a certain part of the
body or the whole packet.
A checksum can also be used to detect whether the receiver has the correct
decryption key, avoiding noise or (worse) denial-of-service attacks. For
that application, the checksum should be computed across the whole packet,
before encrypting the content. Alternatively, a well-known signature could
be added to the packet and included in the encryption, as long as known
plaintext does not weaken the encryption security.
Recommendation: optional for header; if not used, 4-byte signature in data.
3.10 Security
3.10.1 Encryption
Only encryption can provide privacy as long as intruders can monitor the
channel. It is desirable to specify an encryption algorithm and provide
implementations without export restrictions. DES is widely available
outside the United States and could easily be added even to binary-only
applications by dynamic linking.
We have the choice of either encrypting both the header and data or only
the data. Encrypting the header denies the intruder knowledge about some
conference details (for example, who the participants are, although this is
only true as long as the UDP source address does not already reveal that
information). It also allows some heuristic detection of key mismatches, as
the version identifier, timestamp and other header information are somewhat
predictable. However, header encryption makes packet traces and debugging
by external programs difficult.
H. Schulzrinne Expires 4/1/93 [Page 26]
INTERNET-DRAFT RTP October 27, 1992
Public key cryptography does not work for true multicast systems since the
public encoding key for every recipient differs, but it may be appropriate
when used in two-party conversations or application-level multicast. In
that case, mechanisms similar to privacy enhanced mail will probably be
appropriate. Key distribution for non-public key encryption is beyond the
scope of this recommendation.
For one-way applications, it may desirable to prohibit listeners from
interrupting the broadcast. (After all, since live lectures on campus
get disrupted fairly often, there is reason to fear that a sufficiently
controversial lecture carried on the Internet would suffer a similar fate.)
Again, asymmetric encryption can be used. Here, the decryption key is
made available to all receivers, while the encryption key is known only
to the legitimate sender. Current public-key algorithms are probably too
computationally intensive for all but low-bit-rate voice. In most cases,
filtering based on sources will be sufficient.
3.10.2 Authentication
The usual message digest methods are applicable if only the integrity of the
message is to be protected against spoofing.
3.11 Quality of Service Control
Because real-time services cannot afford retransmissions, they are
immediately affected by packet loss and delays. For debugging and
monitoring purposes, it is useful to know exactly where and why losses
occur. Losses occur either within the network or because of excessive
delay within the application. To determine the fraction of losses and the
amount of network loss, knowledge about the number of frames transmitted
is required. A packet sequence number with sufficient range provides the
most reliable and easiest to implement method of gauging packet loss. If
a sequence number is not available, it is difficult to impossible for the
receiver to get an accurate count of the packets transmitted. Thus, the
following RTCP service is suggested for that case.
An RTCP message of type PC (packet count) contains two 32-bit integers, one
containing the timestamp when the measurement was taken, the second the
number of transmitted samples, bytes, packets, or the amount of audio/video
measured in seconds, expressed as a 16/16 timestamp. To make it easier
for the receiver to use that information, the sample should be taken at
a synchronization point, indicated by the synchronization bit in the data
packet (see Section 3.5.1). Since this field is intended to measure network
packet loss, a packet or byte count would be the simplest to maintain, as
the meaning of sample depends on the packet content, for example the number
of channels, the encoding, whether it's audio or video and so on.
H. Schulzrinne Expires 4/1/93 [Page 27]
INTERNET-DRAFT RTP October 27, 1992
The receiver simply stores the number of received samples at each
synchronization point and then, after receiving the PC packet, can determine
the fraction of packets lost so far. Packet reordering may introduce
a slight inaccuracy if the packet sent before the synchronization point
arrives afterwards. Given that there typically is a gap between that
last packet and the synchronization point, this occurrence should be
sufficiently unlikely as to leave the loss measurement accurate enough for
QOS monitoring. This method avoids cumulative errors inherent in estimates
based purely on timestamps.
4 Conference Control Protocol
Currently, only conference control functions used for loose conferences
(open admission, no explicit conference set-up) have been considered in
depth. Support for the following functionality needs to be specified:
o authentication
o floor control, token passing
o invitations, calls
o discovery of conferences and resources (directory service)
o media, encoding and quality-of-service negotiation
o voting
o conference scheduling
The functional specification of a conference control protocol is beyond the
scope of this draft.
5 Packet Format
Given the above technical justifications, the following packet formats are
proposed:
5.1 Data
The data packet header format is shown in Figure 2. The optional 16-bit
framing field and the optional 32-bit IP address designating the network
H. Schulzrinne Expires 4/1/93 [Page 28]
INTERNET-DRAFT RTP October 27, 1992
source are not shown. All integer fields are in network byte order (most
significant byte first).
The content of the fields is defined as follows:
protocol version: two-bit version identifier. The initial version number
is one. The value of zero is reserved for the current vat protocol.
sync (S): synchronization bit, described in Section 3.5.1.
media: media encoding. The five bits form an index into a table of
encodings defined out-of-band. If no mapping has been defined, a
standard mapping to be specified by the IANA is used. The value of
zero is reserved and indicates that the encoding is carried as an
option of type MEDIA. The value of one is reserved and indicates that
the encoding is specified in RTCP packets or the conference control
protocol. If a packet with a media field value of one arrives and no
encoding is known from the conference control protocol, the receiver
should defer playing these packets until a control packet has been
received. If the packet does not contain a MEDIA option, the last
defined encoding is used.
option length: number of 32-bit long words contained within the options
immediately following the header.
sequence number: 16-bit sequence number counting packets.
timestamp: timestamp, reflecting real time. The timestamp consists of the
middle 32 bits of an NTP timestamp.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_Ver_S_ media _ option length _ sequence number _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ timestamp (seconds) _ timestamp (fraction) _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 2: RTP data packet format
The packet header is followed by options, if any, and the media data.
Optional fields are summarized in Table 4. Unless otherwise noted, each
option may appear only once per packet. Each packet may contain any number
of options. Each option consists of the one-byte option type designation,
followed by a one-byte length field denoting the total number of 32-bit
words comprising the option, followed by any option-specific data. Options
are aligned to the natural length of the field, i.e., 16-bit words are
aligned on even addresses, 32-bit words are aligned at addresses divisible
H. Schulzrinne Expires 4/1/93 [Page 29]
INTERNET-DRAFT RTP October 27, 1992
by four, etc. Options unknown to the application are to be ignored. The
MEDIA option, if preent, must precede all other options whose interpretation
depends on the current encoding. Currently, no such options are defined.
type___description______________________________________
MSRC Globally unique media source identifier. A
packet may contain multiple options of this
type, indicating all contributors. A source
is identified by a globally unique six-byte
string. The concatenation of a two-byte
numeric user id unique within the system
followed by a four-byte Internet address is
used(9). If missing, the network source is
considered the media source.
SSRC Globally unique synchronization source identi-
fier. The format is the same as for the MSRC
option. If missing, the network source is
considered the synchronization source.
MEDIA media encoding identification, as discussed
in Section 3.4. The first byte designates
the encoding, with values of 128 through 255
reserved for experimental encodings. Values
of 0 through 127 are assigned by the IANA.
Encoding-specific parameters follow. The
parameter string is padded with zeros until
the option has a length divisible by four.
For audio encodings, a single byte contains a
two-bit channel count in the most significant
bits and a six-bit index an IANA-defined
table of sampling frequencies in the least
significant bits. An index value of zero
designates the natural sampling frequency
defined for each encoding.
ENERG Energy indication. The length and interpre-
tation of this field is media-dependent and
specified for each encoding. The ENERG field
must follow the MEDIA field, if present.
BOP (beginning of playout unit) 16-bit sequence
number designating the first packet within the
current playout unit.
Table 4: Optional fields
5.2 Control Packets
The scope of RTCP is meant to be limited to a single medium, conveying
minimal out-of-band state information during a conference. Thus, any means
of providing reliability are beyond its scope. A version field is not
needed since new control message types can be defined readily. Control
H. Schulzrinne Expires 4/1/93 [Page 30]
INTERNET-DRAFT RTP October 27, 1992
packets are sent periodically to the same multicast group as data packets,
using the same time-to-live value. The period should be varied randomly
to avoid synchronization of all sources. The period determines how long a
new receiver has to wait in the worst case time until it can identify the
source. The control packets defined here extend the functionality found in
vat session packets.
Control packets consists of one or more items using the same format and
alignment as options within the data packet. Non-overlapping type numbers
for data packet options and control message items are to be assigned, so
that control information could be carried in data packets if so desired.
The packet format is shown in Fig. 3, while the item types are defined in
Fig. 5. Padding is used to align fields to multiples of four bytes. The
value used for padding is undefined.
A Port Assignment
Since it is anticipated that UDP and similar port-oriented protocols will
play a major role in carrying RTP traffic, the issue of port assignment
needs to be addressed. The way ports are assigned mainly affects how
applications can extract the packets destined for them. For each medium,
there also needs to be a mechanism for distinguishing data from control
packets.
For unicast UDP, only the port number is available for demultiplexing.
Thus, each media will need a separate port number pair unless a separate
demultiplexing agent is used. However, for one-to-one connections,
dynamically negotiating a port number is easy. If several UDP streams are
used to provide multicast, the port number issue becomes more thorny.
For connection-oriented protocols like ST-II or TCP, only packets for a
particular connection reach the application.
For UDP multicast, an application can select to receive only packets with a
particular port number and multicast address by binding to the appropriate
multicast address. Thus, for UDP multicast, there is no need to distinguish
media by port numbers, as each medium is assumed to have its designated and
unique multicast group. Any dynamic port allocation mechanism would fail
for large, dynamic multicast groups, but might be appropriate for small
conferences and two-party conversations.
Data and control packets for a single medium can either share a single port
or use two different port numbers. (Currently, two adjacent port numbers
are used.) A single port for data and control simplifies the receiver code
and conserves port numbers. It requires some other means of identifying
control packets, for example as a special media code, and does not allow the
sharing of a single control port by several applications.
H. Schulzrinne Expires 4/1/93 [Page 31]
INTERNET-DRAFT RTP October 27, 1992
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ type=ENERG _ length=1 _ energy level _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ type=MSRC _ length=2 _ user id _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ IP address of media source _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ type=SSRC _ length=2 _ user id _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ IP address of synchronization source _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ type=MEDIA _ length=2 _ encoding _ch#_sampling f._
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ encoding-specific parameters _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
_ type=BOP _ length=1 _ first seq.# in playout unit _
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 3: RTCP packet format
H. Schulzrinne Expires 4/1/93 [Page 32]
INTERNET-DRAFT RTP October 27, 1992
type___description______________________________________
ID The media or synchronization source identi-
fier, using the same 6-byte format as the
MSRC and SSRC options. This identifier
applies to all following fields until the
next ID field. This results in more compact
coding when application gateways are used and
allows aggregation of several sources into one
control message.
ALIAS A variable-length string padded with zeros so
that the total length of the item, including
the type and length bytes, is a multiple
of four bytes. The content of the field
describes the media source identified by the
most recent ID item, for example, by giving
the name and affiliation of the talker or
the call letters of the radio station being
rebroadcast. The content is not specified or
authenticated. The text is encoded as 7-bit
US ASCII values from 32 to 127 (decimal). The
escape mechanism for character sets other than
US-ASCII text remains to be must be defined
(ISO 2022?).
DESC Media content description, with the same
format as ALIAS. The field describes the
current media content. Example applications
include the session title for a conference
distribution, or the current program title
for radio or television redistribution through
packet networks.
BYE The site specified by the most recent ID field
requests to be dropped from the conference.
No further data. Padded to 32 bit word
length.
PC 16 bits of padding are followed by a
16/16 32-bit timestamp (same format as the
synchronization timestamp) and a 32 bit packet
count. The item specifies the number of
packets transmitted by the sender of this RTCP
message up to the time specified.
TIME 16 bits of padding wallclock time and media
clock, both expressed as 16/16 timestamps.
MEDIA media description (see Table 4)
Table 5: The RTCP message types
H. Schulzrinne Expires 4/1/93 [Page 33]
INTERNET-DRAFT RTP October 27, 1992
Using a single RTCP stream for several media may be advantageous to
avoid duplicating, for example, the same identification information for
voice, video and whiteboard streams. This works only if there is one
multicast group that all members of a conference subscribe to. Given
the relatively low frequency of control messages, the coordination effort
between applications and the necessity to designate control messages for a
particular medium are probably reasons enough to have each application send
control messages to the same multicast group as the data.
In conclusion, for multicast UDP, two assigned port numbers, one for data
and one for control, seem to offer the most flexibility.
B Multicast Address Allocation
A fixed allocation of network multicast addresses to conferences is clearly
not feasible, since the lifetime of conferences is unknown, the potential
number of conferences is rather large and the available number space limited
to about 228, of which 216 have been assigned to conferences.
Dynamic allocation of addresses without intervention of some centralized
clearing house mechanism appears to be difficult. One approach would be
akin to carrier sense multiple access: A conference originator would listen
on a randomly selected multicast address using the session port (it is left
as an exercise to the reader to imagine what happens if a data port is
used). Within a small multiple of the session announcement interval (with
vat, this interval averages six seconds), we would have some indication
of whether the address is in use. This technique may fail for a number
of reasons. First, collisions if the same multicast address is checked
nearly simultaneously are possible, if unlikely as long as the number space
is only sparsely utilized. More seriously, it is quite possible that
multicast islands using the same multicast group are unaware of each other
as they are isolated due to time-to-live restrictions or temporary network
interruptions. It is clearly undesirable to be forced to renegotiate a new
multicast address in the middle of a conference because time-to-live values
or network connectivity have changed.
It appears to the author that since multicasting takes place at the
IP-level, we would have to check all potential ports to avoid drawing
multicast traffic with the same group but different destination port towards
us. Some IP-level mechanism would have to be added to the kernel to avoid
having to scan all ports.
A probe packet sent with maximum time-to-live to the desired address would
avoid missing time-to-live-isolated islands and detect temporarily idle
multicast groups, but would impose a rather severe load on the network,
without solving temporary network splittings. Probe packets and responses
could also get lost. Using probe packets also requires an agreement that
all potential users of the range of multicast addresses would indeed respond
H. Schulzrinne Expires 4/1/93 [Page 34]
INTERNET-DRAFT RTP October 27, 1992
to a probe packet.
Using the conference identifier at the RTP level to detect collisions may
have severe performance consequences for both the network and the receiving
host if the conference sharing the same multicast group happens to send
high-bandwidth data.
One solution would be to provide a hierarchical allocation of addresses.
Here, the originator of a conference asks the nearest address provider for
an available address. The provider in turn asks the next level up (for
example, the regional network) or a peer if it had temporarily run out of
addresses. The conference originator would be responsible for returning the
address after use. The return of addresses after use raises the issue of
what happens if either the requesting agent or the issuer of the address
crashes. A timeout mechanism is probably most robust. Addresses could be
issued for a certain number of hours. If the original requester renews the
request before the expiration of the timeout period, it is guaranteed to
have the request granted. With that policy, requester or issuer crashes can
be handled gracefully under most circumstances. It remains to be decided
what a conference originator is supposed to do if an address renewal request
fails because the address provider has crashed or connectivity has been
lost.
It is imaginable that each site would pay an access fee for a block of
addresses, similar to the access-speed dependent fee charged for network
connectivity within the Internet. This would provide local incentives
for each administrative domain (AD) to recoup unused addresses. Trading
of smaller address blocks between friendly ADs could accommodate peak
demands or clearing-house failures, similar to the mutual support agreements
between electrical utilities. For increased reliability, each AD could
offer multiple clearing-houses, just as it typically maintains several name
servers.
As an extension, it may be desirable to distinguish multicast addresses with
different reach. A local address would be given out with the restriction of
a maximum time-to-live value and could thus be reused at an AD sufficiently
removed, akin to the combination of cell reuse and power limitation in
cellular telephony. Given that many conferences will be local or regional
(e.g., broadcasting classes to nearby campuses of the same university or a
regional group of universities, or an electronic town meeting), this should
allow significant reuse of addresses. Reuse of addresses requires careful
engineering of thresholds and would probably only be useful for very small
time-to-live values that restrict reach to a single local area network.
The proposed allocation mechanism has no single point of failure, scales
well and conserves the addressing resources by providing appropriate
incentives, combined with local control. It requires sufficient address
space to supply the hierarchy.(10) The address allocation may or may not be
------------------------------
10. The ideas presented here are compatible with the more general proposals
H. Schulzrinne Expires 4/1/93 [Page 35]
INTERNET-DRAFT RTP October 27, 1992
handled by the same authority that provides conference naming and discovery
services.
C Glossary
The glossary below briefly defines the acronyms used within the text.
Further definitions can be found in the Internet draft
draft-ietf-userglos-glossary-00.txt
available for anonymous ftp from nnsc.nsf.net and other sites. Some of the
general Internet definitions below are copied from that glossary.
16/16 timestamp: A 32-bit integer timestamp consisting of a 16-bit field
containing the number of seconds followed by a 16-bit field containing
the binary fraction of a second. This timestamp can measure about 18.2
hours with a resolution of approximately 15 =s.
ADPCM: Adaptive differential pulse code modulation. Rather than
transmitting ! PCM samples directly, the difference between the
estimate of the next sample and the actual sample is transmitted. This
difference is usually small and can thus be encoded in fewer bits than
the sample itself. The ! CCITT recommendations G.721, G.723, G.726 and
G.727 describe ADPCM encodings.
CCITT: Comite Consultatif International de Telegraphique et Telephonique
(CCITT). This organization is part of the United National International
Telecommunications Union (ITU) and is responsible for making technical
recommendations about telephone and data communications systems. X.25
is an example of a CCITT recommendation. Every four years CCITT holds
plenary sessions where they adopt new recommendations. Recommendations
are known by the color of the cover of the book they are contained in.
CELP: code-excited linear prediction; audio encoding method for low-bit
rate codecs.
CD: compact disc.
codec: short for coder/decoder; device or software that ! encodes and
decodes audio or video information.
------------------------------
contained in ``Remote Conferencing Architecture'' by Yee-Hsiang Chang and
Jon Whaley.
H. Schulzrinne Expires 4/1/93 [Page 36]
INTERNET-DRAFT RTP October 27, 1992
companding: reducing the dynamic range of audio or video by a non-linear
transformation of the sample values. The best known methods for audio
are =-law, used in North America, and A-law, used in Europe and Asia.
!G.711 [10]
DAT: digital audio tape.
encoding: transformation of the media content for transmission, usually to
save bandwidth, but also to decrease the effect of transmission errors.
Well-known encodings are G.711 (=-law PCM), and ADPCM for audio, JPEG
and MPEG for video. ! encryption
encryption: transformation of the media content to ensure that only the
intended recipients can make use of the information. ! encoding
end system: host where conference participants are located. RTP packets
received by an end system are played out, but not forwarded to other
hosts (in a manner visible to RTP).
frame: unit of information. Commonly used for video to refer to a single
picture. For audio, it refers to a data that forms a encoding unit.
For example, an LPC frame consists of the coefficients necessary to
generate a specific number of audio samples.
G.711: ! CCITT recommendation for ! PCM audio encoding at 64 kb/s using
=-law or A-law companding.
G.764: ! CCITT recommendation for packet voice; specifies both ! HDLC-like
data link and network layer. In the draft stage, this standard was
referred to as G.PVNP. The standard is primarily geared towards digital
circuit multiplication equipment used by telephone companies to carry
more voice calls on transoceanic links.
G.PVNP: designation of CCITT recommendation ! G.764 while in draft status.
GSM: Group Speciale Mobile. In general, designation for European mobile
telephony standard. In particular, often used to describe the 8 kb/s
audio coding used.
H.261: ! CCITT recommendation for the compression of motion video at rates
of Px 64 kb/s. Originally intended for narrowband !ISDN.
hangover: Audio data transmitted after the silence detector indicates that
no audio data is present. Hangover ensures that the ends of words,
important for comprehension, are transmitted even though they are often
of low energy.
HDLC: high-level data link control; standard data link layer protocol
(closely related to LAPD and SDLC).
H. Schulzrinne Expires 4/1/93 [Page 37]
INTERNET-DRAFT RTP October 27, 1992
ICMP: Internet Control Message Protocol; ICMP is an extension to the
Internet Protocol. It allows for the generation of error messages,
test packets and informational messages related to ! IP.
in-band: signaling information is carried together (in the same channel or
packet) with the actual data. ! out-of-band.
IP: internet protocol; the Internet Protocol, defined in RFC 791, is the
network layer for the TCP/IP Protocol Suite. It is a connectionless,
best-effort packet switching protocol [11].
IP address: four-byte binary host interface identifier used by !IP for
addressing. An IP address consists of a network portion and a
host portion. RTP treats IP addresses as globally unique, opaque
identifiers.
IPv4: current version (4) of ! IP.
ISDN: integrated services digital network; refers to an end-to-end circuit
switched digital network intended to replace the current telephone
network. ISDN offers circuit-switched bandwidth in multiples of 64
kb/s (B or bearer channel), plus a 16 kb/s packet-switched data (D)
channel.
JPEG: joint photographic experts group. Designation of a variable-rate
compression algorithm using discrete cosine transforms for still-frame
color images.
LPC: linear predictive coder. Audio encoding method that models speech as
a parameters of a linear filter; used for very low bit rate codecs.
loosely controlled conference: Participants can join and leave the
conference without connection establishment or notifying a conference
moderator. The identity of conference participants may or may not be
known to other participants. See also: tightly controlled conference.
MPEG: motion picture experts group. Designates a variable-rate compression
algorithm for full motion video at low bit rates; uses both intraframe
and interframe coding.
media source: entity (user and host) that produced the media content.
It is the entity that is shown as the active participant by the
application.
MTU: maximum transmission unit; the largest frame length which may be sent
on a physical medium.
Nevot: network voice terminal; application written by the author.
network source: entity denoted by address and port number from which the !
end system receives the RTP packet and to which the end system send any
H. Schulzrinne Expires 4/1/93 [Page 38]
INTERNET-DRAFT RTP October 27, 1992
RTP packets for that conference in return.
NVP: network voice protocol, original packet format used in early packet
voice experiments; defined in RFC 741 [3].
OSI: Open System Interconnection; a suite of protocols, designed by
ISO committees, to be the international standard computer network
architecture.
out of band: signaling and control information is carried in a separate
channel or separate packets from the actual data. For example, ICMP
carries control information out-of-band, that is, as separate packets,
for IP, but both ICMP and IP usually use the same communication channel
(in band).
PCM: pulse-code modulation; speech coding where speech is represented by a
given number of fixed-width samples per second. Often used for the
coding employed in the telephone network: 64,000 eight-bit samples per
second.
playout: Delivery of the medium content to the final consumer within the
receiving host. For audio, this implies digital-to-analog conversion,
for video display on a screen.
PVP: packet video protocol; extension of ! NVP to video data [12]
SB: subband; as in subband codec. Audio or video encoding that splits the
frequency content of a signal into several bands and encodes each band
separately, with the encoding fidelity matched to human perception for
that particular frequency band.
RTCP: real-time control protocol; adjunct to ! RTP.
RTP: real-time transport protocol; discussed in this draft.
ST-II: stream protocol; connection-oriented unreliable, non-sequenced
packet-oriented network and transport protocol with process demulti-
plexing and provisions for establishing flow parameters for resource
control; defined in RFC 1090 [2].
TCP: transmission control protocol; An Internet Standard transport layer
protocol defined in RFC 793. It is connection-oriented and
stream-oriented, as opposed to UDP [13].
TPDU: transport protocol data unit.
tightly controlled conference: Participants can join the conference only
after an invitation from a conference moderator. The identify of all
conference participants is known to the moderator. !loosely controlled
conference.
H. Schulzrinne Expires 4/1/93 [Page 39]
INTERNET-DRAFT RTP October 27, 1992
transcoder: device or application that translates between several
encodings, for example between ! LPC and ! PCM.
UDP: user datagram protocol; unreliable, non-sequenced connectionless
transport protocol defined in RFC 768 [14].
vat: Visual audio application (voice terminal) written by Steve McCanne and
Van Jacobson.
vt: Voice terminal software written at the Information Sciences Institute.
VMTP: Versatile message transaction protocol; defined in RFC 1045 [15].
D Address of Author
Henning Schulzrinne
AT&T Bell Laboratories
MH 2A244
600 Mountain Avenue
Murray Hill, NJ 07974
telephone: 908 582-2262
electronic mail: hgs@research.att.com
References
[1] S. Casner, C. Lynn, Jr., P. Park, K. Schroder, and C. Topolcic,
``Experimental internet stream protocol, version 2 (ST-II),'' Network
Working Group Request for Comments RFC 1190, Information Sciences
Institute, Oct. 1990.
[2] C. Topolcic, ``ST II,'' in First International Workshop on Network and
Operating System Support for Digital Audio and Video, no. TR-90-062 in
ICSI Technical Reports, (Berkeley, CA), 1990.
[3] D. Cohen, ``Specification for the network voice protocol (nvp),''
Network Working Group Request for Comment RFC 741, ISI, Jan. 1976.
[4] N. Borenstein and N. Freed, ``MIME (multipurpose internet mail
extensions) mechanisms for specifying and describing the format of
internet message bodies,'' Network Working Group Request for Comments
RFC 1341, Bellcore, June 1992.
[5] J. G. Gruber and L. Strawczynski, ``Subjective effects of variable
delay and speech clipping in dynamically managed voice systems,'' IEEE
Transactions on Communications, vol. COM-33, pp. 801--808, Aug. 1985.
H. Schulzrinne Expires 4/1/93 [Page 40]
INTERNET-DRAFT RTP October 27, 1992
[6] N. S. Jayant, ``Effects of packet losses in waveform coded speech and
improvements due to an odd-even sample-interpolation procedure,'' IEEE
Transactions on Communications, vol. COM-29, pp. 101--109, Feb. 1981.
[7] D. Minoli, ``Optimal packet length for packet voice communication,''
IEEE Transactions on Communications, vol. COM-27, pp. 607--611, Mar.
1979.
[8] V. Jacobson, ``Compressing TCP/IP headers for low-speed serial
links,'' Network Working Group Request for Comments RFC 1144, Lawrence
Berkeley Laboratory, Feb. 1990.
[9] D. L. Mills, ``Network time protocol (version 2) --- specification and
implementation,'' Network Working Group Request for Comments RFC 1119,
University of Delaware, Sept. 1989.
[10] N. S. Nayant and P. Noll, Digital Coding of Waveforms. Englewood
Cliffs, NJ: Prentice Hall, 1984.
[11] J. Postel, ``Internet protocol,'' Network Working Group Request for
Comments RFC 791, Information Sciences Institute, Sept. 1981.
[12] R. Cole, ``PVP - a packet video protocol,'' W-Note 28, Information
Sciences Institute, University of Southern California, Los Angeles,
CA, Aug. 1981.
[13] J. B. Postel, ``DoD standard transmission control protocol,'' Network
Working Group Request for Comments RFC 761, Information Sciences
Institute, Jan. 1980.
[14] J. B. Postel, ``User datagram protocol,'' Network Working Group
Request for Comments RFC 768, ISI, Aug. 1980.
[15] D. R. Cheriton, ``VMTP: Versatile Message Transaction Protocol
specification,'' in Network Information Center RFC 1045, (Menlo Park,
CA), pp. 1--123, SRI International, Feb. 1988.
H. Schulzrinne Expires 4/1/93 [Page 41]