TOC |
|
By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”
The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.
This Internet-Draft will expire on April 11, 2009.
This document specifies the payload format for packetization of the G.719 full-band codec encoded audio signals into the Real-time Transport Protocol (RTP). The payload format supports transmission of multiple channels, multiple frames per payload, and interleaving.
1.
Introduction
2.
Definitions and Conventions
3.
G.719 Description
4.
Payload format Capabilities
4.1.
Multi-rate Encoding and Rate Adaptation
4.2.
Support for Multi-Channel Sessions
4.3.
Robustness against Packet Loss
4.3.1.
Use of Forward Error Correction (FEC)
4.3.2.
Use of Frame Interleaving
5.
Payload format
5.1.
RTP Header Usage
5.2.
Payload Structure
5.2.1.
Basic ToC element
5.3.
Basic mode
5.4.
Interleaved mode
5.5.
Audio Data
5.6.
Implementation Considerations
5.6.1.
Receiving Redundant Frames
5.6.2.
Interleaving
5.6.3.
Decoding Validation
6.
Payload Examples
6.1.
3 mono frames with 2 different bitrates
6.2.
2 stereo frame-blocks of the same bitrate
6.3.
4 mono frames interleaved
7.
Payload Format Parameters
7.1.
Media Type Definition
7.2.
Mapping to SDP
7.2.1.
Offer/Answer Considerations
7.2.2.
Declarative SDP Considerations
8.
IANA Considerations
9.
Congestion Control
10.
Security Considerations
10.1.
Confidentiality
10.2.
Authentication and Integrity
11.
Acknowledgements
12.
References
12.1.
Informative References
12.2.
Normative References
§
Authors' Addresses
§
Intellectual Property and Copyright Statements
TOC |
This document specifies the payload format for packetization of the G.719 full-band (FB) codec encoded audio signals into the Real-time Transport Protocol (RTP) [RFC3550] (Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, “RTP: A Transport Protocol for Real-Time Applications,” July 2003.). The payload format supports transmission of multiple channels, multiple frames per payload, packet loss robustness methods using redundancy or interleaving.
This document starts with conventions, a brief description of the codec, and the payload formats capabilities. The payload format is specified in Section 5 (Payload format). Examples can be found in Section 6 (Payload Examples). The media type and its mappings to SDP, usage in SDP offer/answer is then specified. The document ends with considerations around congestion control and security.
TOC |
The term "frame-block" is used in this document to describe the time-synchronized set of audio frames in a multi-channel audio session. In particular, in an N-channel session, a frame-block will contain N audio frames, one from each of the channels, and all N speech frames represents exactly the same time period.
This document contains depictions of bit fields. The most significant bit is always leftmost in the figure on each row and have the lowest enumeration. For fields that are depicted over multiple rows the upper row is more significant than the next.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 (Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” March 1997.) [RFC2119].
TOC |
The ITU-T G.719 full-band codec is a transform coder based on Modulated Lapped Transform (MLT). G.719 is a low complexity full bandwidth codec for conversational speech and audio coding. The encoder input and decoder output are sampled at 48 kHz. The codec enables full bandwidth, from 20 Hz to 20 kHz, encoding of speech, music and general audio content at rates from 32 kbit/s up to 128 kbit/s. The codec operates on 20ms frames and has an algorithmic delay of 40 ms.
The codec provides excellent quality for speech, music and other types of audio. Some of the applications for which this coder is suitable are:
The encoding and decoding algorithm can change the bit rate at any 20ms frame boundary. The encoder receives the audio sampled at 48kHz. The support of other sampling rates is possible by re-sampling the input signal to the codec's sampling rate, i.e. 48kHz, however, this functionality is not part of the standard.
The encoding is performed on equally sized frames. For each frame, the encoder decides between two encoding modes, a transient mode and a stationary mode. The decision is based on statistics derived from the input signal. The stationary mode uses a long MLT that leads to a spectrum of 960 coefficients while the transient encoding mode uses a short MLT (higher time resolution transform) which results in 4 spectra (4 x 240 = 960 coefficients). The encoding of the spectrum is done in two steps. First, the spectral envelope is computed, quantized and Huffman encoded. The envelope is computed on a non-uniform frequency subdivision. From the coded spectral envelope, a weighted spectral envelope is derived and is used for bit-allocation, this process is also repeated at the decoder, thus only the spectral envelope is transmitted. The output of the bit-allocation is used in order to quantize the spectra. In addition, for stationary frames the encoder estimates the amount of noise level. The decoder applies the reverse operation upon reception of the bit stream. The non-coded coefficients (i.e. no bits allocated) are replaced by entries of a noise codebook which is built based on the decoded coefficients.
TOC |
This payload format have a number of capabilities and this section discuss them in some detail.
TOC |
G.719 supports multi-rate encoding capability that enables on a per frame basis variation of the encoding rate. This enables support for bit-rate adaptation and congestion control. The possibility to aggregate multiple audio frames into a single RTP payload is another dimension of adaptation. The RTP and payload format overhead can thus be reduced by the aggregation at the cost of increased delay and reduced packet-loss robustness.
TOC |
The RTP payload format defined in this document supports multi-channel audio content (e.g. stereophonic or surround audio sessions). Although the G.719 codec itself does not support encoding of multi-channel audio content into a single bit stream, it can be used to separately encode and decode each of the individual channels. To transport (or store) the separately encoded multi-channel content, the audio frames for all channels that are framed and encoded for the same 20 ms period are logically collected in a "frame-block".
At the session setup, out-of-band signaling must be used to indicate the number of channels in the payload type. The order of the audio frames within the frame-block depends on the number of the channels and follows the definition in Section 4.1 of the RTP/AVP Profile (Schulzrinne, H. and S. Casner, “RTP Profile for Audio and Video Conferences with Minimal Control,” July 2003.) [RFC3551]. When using SDP for signaling, the number of channels is specified in the rtpmap attribute.
TOC |
The payload format supports several means, including forward error correction (FEC) and frame interleaving, to increase robustness against packet loss.
TOC |
Generic forward error correction within RTP is defined, for example, in RFC 5109 [RFC5109] (Li, A., “RTP Payload Format for Generic Forward Error Correction,” December 2007.). Audio redundancy coding is defined in RFC 2198 [RFC2198] (Perkins, C., Kouvelas, I., Hodson, O., Hardman, V., Handley, M., Bolot, J., Vega-Garcia, A., and S. Fosse-Parisis, “RTP Payload for Redundant Audio Data,” September 1997.). Either scheme can be used to add redundant information to the RTP packet stream and make it more resilient to packet losses, at the expense of a higher bit rate. Please see either RFCs for a discussion of the implications of the higher bit rate to network congestion.
In addition to these media-unaware mechanisms, this memo specifies an optional G.719 specific form of audio redundancy coding, which may be beneficial in terms of packetization overhead. Conceptually, previously transmitted transport frames are aggregated together with new ones. A sliding window can be used to group the frames to be sent in each payload. However, irregular or non-consecutive patterns are also possible by inserting NO_DATA frames between primary and redundant transmissions. Figure 1 (An example of redundant transmission) below shows an example.
--+--------+--------+--------+--------+--------+--------+--------+-- | f(n-2) | f(n-1) | f(n) | f(n+1) | f(n+2) | f(n+3) | f(n+4) | --+--------+--------+--------+--------+--------+--------+--------+-- <---- p(n-1) ----> <----- p(n) -----> <---- p(n+1) ----> <---- p(n+2) ----> <---- p(n+3) ----> <---- p(n+4) ---->
Figure 1: An example of redundant transmission |
Here, each frame is retransmitted once in the following RTP payload packet. f(n-2)...f(n+4) denote a sequence of audio frames, and p(n-1)...p(n+4) a sequence of payload packets.
The mechanism described does not really require signaling at the session setup. However, signalling has been defined to allow for the sender to voluntarily bounding the buffering and delay requirements. If nothing is signalled the use of this mechanism is allowed and unbounded. For a certain timestamp, the receiver may receive multiple copies of a frame containing encoded audio data, even at different encoding rates. The cost of this scheme is bandwidth and the receiver delay necessary to allow the redundant copy to arrive.
This redundancy scheme provides a functionality similar to the one described in RFC 2198, but it works only if both original frames and redundant representations are G.719 frames. When the use of other media coding schemes is desirable, one has to resort to RFC 2198.
The sender is responsible for selecting an appropriate amount of redundancy based on feedback about the channel conditions, e.g., in the RTP Control Protocol (RTCP) [RFC3550] (Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, “RTP: A Transport Protocol for Real-Time Applications,” July 2003.) receiver reports. The sender is also responsible for avoiding congestion, which may be exacerbated by redundancy (see Section 9 (Congestion Control) for more details).
TOC |
To decrease protocol overhead, the payload design allows several audio transport frames to be encapsulated into a single RTP packet. One of the drawbacks of such an approach is that in case of packet loss several consecutive frames are lost. Consecutive frame loss normally renders error concealment less efficient and usually causes clearly audible and annoying distortions in the reconstructed audio. Interleaving of transport frames can improve the audio quality in such cases by distributing the consecutive losses into a number of isolated frame losses, which are easier to conceal. However, interleaving and bundling several frames per payload also increases end-to-end delay and sets higher buffering requirements. Therefore, interleaving is not appropriate for all use cases or devices. Streaming applications should most likely be able to exploit interleaving to improve audio quality in lossy transmission conditions.
Note that this payload design supports the use of frame interleaving as an option. The usage of this feature needs to be negotiated in the session setup.
The interleaving supported by this format is rather flexible. For example, a continuous pattern can be defined, as depicted in Figure 2 (An example of interleaving pattern that has constant delay).
--+--------+--------+--------+--------+--------+--------+--------+-- | f(n-2) | f(n-1) | f(n) | f(n+1) | f(n+2) | f(n+3) | f(n+4) | --+--------+--------+--------+--------+--------+--------+--------+-- [ p(n) ] [ p(n+1) ] [ p(n+1) ] [ p(n+2) ] [ p(n+2) ] [ p(n+3) ] [ p(n+4) ]
Figure 2: An example of interleaving pattern that has constant delay |
In Figure 2 (An example of interleaving pattern that has constant delay) the consecutive frames, denoted f(n-2) to f(n+4), are aggregated into packets p(n) to p(n+4), each packet carrying two frames. This approach provides an interleaving pattern that allows for constant delay in both the interleaving and de-interleaving processes. The de-interleaving buffer needs to have room for at least three frames, including the one that is ready to be consumed. The storage space for three frames is needed, for example, when f(n) is the next frame to be decoded: since frame f(n) was received in packet p(n+2), which also carried frame f(n+3), both these frames are stored in the buffer. Furthermore, frame f(n+1) received in the previous packet, p(n+1), is also in the de-interleaving buffer. Note also that in this example the buffer occupancy varies: when frame f(n+1) is the next one to be decoded, there are only two frames, f(n+1) and f(n+3), in the buffer.
TOC |
The main purpose of the payload design for G.719 is to maximize the potential of the codec to its fullest degree with an as minimal overhead as possible. In the design both basic and interleaved modes have been included as the codec is suitable both for conversational and other low delay applications as well as streaming, where more delay is acceptable.
The main structural difference between the basic and interleaved modes is the extension of the table of content entries with frame displacement fields in the interleaved mode. The basic mode supports aggregation of multiple consecutive frames in a payload. The interleaved mode supports aggregation of multiple frames that are non-consecutive in time. In both modes it is possible to have frames encoded with different frame types in the same payload.
The payload format also supports the usage of G.719 for carrying multi-channel content using one discrete encoder per channel all using the same bit-rate. In this case a complete frame-block with data from all channels are included in the RTP payload. The data is the concatenation of all the encoded audio frames in the order specified for that number of included channels. Also interleaving is done on complete frame-blocks rather than individual audio frames.
TOC |
The RTP timestamp corresponds to the sampling instant of the first sample encoded for the first frame-block in the packet. The timestamp clock frequency SHALL be 48000 Hz. The timestamp is also used to recover the correct decoding order of the frame-blocks.
The RTP header marker bit (M) SHALL be set to 1 whenever the first frame-block carried in the packet is the first frame-block in a talkspurt (see definition of the talkspurt in section 4.1 of [RFC3551] (Schulzrinne, H. and S. Casner, “RTP Profile for Audio and Video Conferences with Minimal Control,” July 2003.)). For all other packets the marker bit SHALL be set to zero (M=0).
The assignment of an RTP payload type for the format defined in this memo is outside the scope of this document. The RTP profiles in use currently mandates binding the payload type dynamically for this payload format. This is basically necessary due to that the payload type expresses the configuration of the payload itself, i.e. basic or interleaved mode and the number of channels carried.
The remaining RTP header fields are used as specified in RFC 3550 [RFC3550] (Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, “RTP: A Transport Protocol for Real-Time Applications,” July 2003.).
TOC |
The payload consists of one or more table of contents (ToC) entires followed by the audio data corresponding to the ToC entries. The following sections describe both the basic mode and the interleaved mode. Each ToC entry MUST be padded to a byte boundary to ensure octet alignment. The rules regarding maximum payload size given in Section 3.2 of [I‑D.ietf‑tsvwg‑udp‑guidelines] (Eggert, L. and G. Fairhurst, “Unicast UDP Usage Guidelines for Application Designers,” October 2008.) SHOULD be followed.
TOC |
All the different formats and modes in this draft use a common basic ToC which may be extended in the different options described below.
0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ |F| L |R|R| +-+-+-+-+-+-+-+-+
Figure 3: Basic TOC element |
- F (1 bit):
- If set to 1, indicates that this ToC entry is followed by another ToC entry; if set to 0, indicates that this ToC entry is the last one in the ToC.
- L (5 bits):
- A field that gives the frame length of each individual frame within the frame-block.
L length(bytes) ============================ 0 0 NO_DATA 1-7 N/A (reserved) 8-22 80+10*(L-8) 23-27 240+20*(L-23) 28-31 N/A (reserved)
Figure 4: How to map L values to frame lengths
L=0 (NO_DATA) is used to indicate an empty frame, this is useful if frames are missing e.g at re-packetization or to insert gaps when sending redundant frames together with primary frames in the same payload.The value range [1..7] and [28..31] inclusive is reserved for future use in this draft version, if these values occur in a ToC the entire packet SHOULD be treated as invalid and discarded.
A few examples are given below where the frame size and the corresponding codec bitrate is computed based on the value L.
L Bytes Codec Bitrate(kbps) =================================== 8 80 32 9 90 36 10 100 40 12 120 48 16 160 64 22 220 88 23 240 96 25 280 112 27 320 128
Figure 5: Examples of L values and corresponding frame lengths
This encoding yields a granularity of 4kbps between 32 and 88kbps and a granularity of 8kbps between 88 and 128kbps with a defined range of 32-128kbps for the codec data.- R (2bits):
- Reserved bits. SHALL be set to 0 on sending and SHALL be ignored on reception.
TOC |
The basic ToC element Figure 3 (Basic TOC element) is followed by a one octet field for the number of frame-blocks (#frames) to form the ToC entry. The frame-blocks field tells how many frame-blocks of the same length the ToC entry relates to.
0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ | #frames | +-+-+-+-+-+-+-+-+
Figure 6: Number of frame-blocks field |
TOC |
The basic ToC is followed by a one octet field for the number of frame-blocks (#frames) and then the DIS fields to form a ToC entry in interleaved mode. The frame-blocks field tells how many frame-blocks of the same length the ToC relates to. The DIS fields, one for each frame-block indicated by the #frames field, express the interleaving distance between audio frames carried in the payload. If necessary to achieve octet alignment, a 4-bit padding is added.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | #frames | DIS1 | ... | DISi | ... | DISn | Padd | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 7: Number of frame-block + interleave fields |
- DIS1...DISn (4 bits):
- A list of n (n=#frames) displacement fields indicating the displacement of the i:th (i=1..n) audio frame-block relative to the preceding frame-block in the payload, in units of 20 ms long audio frame-blocks). The four-bit unsigned integer displacement values may be between 0 and 15 indicating the number of audio frame-blocks in decoding order between the (i-1):th and the i:th frame in the payload. Note that for the first ToC entry of the payload the value of DIS1 is meaningless. It SHALL be set to zero by a sender, and SHALL be ignored by a receiver. This frame-block's location in the decoding order is uniquely defined by the RTP timestamp. Note that for subsequent ToC entries DIS1 indicates the number of frames between the last frame of the previous group and the first frame of this group.
- Padd (4 bits):
- To ensure octet alignment, four padding bits SHALL be included at the end of the ToC entry in case there is an odd number of frame-blocks in the group referenced by this ToC entry. These bits SHALL be set to zero and SHALL be ignored by the receiver. If a group containing an even number of frames is referenced by this ToC entry, these padding bits SHALL NOT be included in the payload.
TOC |
The audio data part follows the table of contents. All the octets comprising an audio frame SHALL be appended to the payload as a unit. For each frame-block the audio frames are concatenated in order indicated by table in Section 4.1 of [RFC3551] (Schulzrinne, H. and S. Casner, “RTP Profile for Audio and Video Conferences with Minimal Control,” July 2003.) for the number of channels configured for the payload type in use. So the first channel (left most) indicated comes first followed by the next channel. The audio frame-blocks are packetized in increasing timestamp order within each group of frame-blocks (per ToC entry), i.e. oldest frame-block first. The groups of frame-blocks are packetized in the same order as their corresponding ToC entries.
The audio frames are specified in ITU recommendation [ITU‑T‑G719] (ITU-T, “Specification : ITU-T G.719 extension for 20 kHz fullband audio,” April 2008.).
The G.719 bit stream is split into a sequence of octets and transmitted in order from the left most (most significant–MSB) bit to the right most (least significant –LSB) bit.
TOC |
An application implementing this payload format MUST understand all the payload parameters specified in this specification. Any mapping of the parameters to a signaling protocol MUST support all parameters. So an implementation of this payload format in an application using SDP is required to understand all the payload parameters in their SDP-mapped form. This requirement ensures that an implementation always can decide whether it is capable of communicating when the communicating enties support this version of the specification.
Basic mode SHALL be implemented and the interleaved mode SHOULD be implemented. The implementation burden of both is rather small, and supporting both ensures interoperability. However, interleaving is not mandated as it has limited applicability for conversational application that requires tight delay boundaries.
TOC |
The reception of redundant audio frames, i.e. more than one audio frame from the same source for the same time slot, MUST be supported by the implementation. In the case that the receiver gets multiple audio frames in different bit-rates for the same time slot it is RECOMMENDED that the receiver keeps the one with the highest bit-rate.
TOC |
The use of interleaving requires further considerations. As presented in the example in Section 4.3.2 (Use of Frame Interleaving), a given interleaving pattern requires a certain amount of the de-interleaving buffer. This buffer space, expressed in a number of transport frame slots, is indicated by the "interleaving" media type parameter. The number of frame slots needed can be converted into actual memory requirements by considering the 320 bytes per frame used by the highest bit-rate rate of G.719.
The information about the frame buffer size is not always sufficient to determine when it is appropriate to start consuming frames from the interleaving buffer. Additional information is needed when the interleaving pattern changes. The "int-delay" media type parameter is defined to convey this information. It allows a sender to indicate the minimal media time that needs to be present in the buffer before the decoder can start consuming frames from the buffer. Because the sender has full control over the interleaving pattern, it can calculate this value. In certain cases (for example, if joining a multicast session with interleaving mid-session), a receiver may initially receive only part of the packets in the interleaving pattern. This initial partial reception (in frame sequence order) of frames can yield too few frames for acceptable quality from the audio decoding. This problem also arises when using encryption for access control, and the receiver does not have the previous key. Although the G.719 is robust and thus tolerant to a high random frame erasure rate, it would have difficulties handling consecutive frame losses at startup. Thus, some special implementation considerations are described.
In order to handle this type of startup efficiently, decoding can start provided that:
After receiving a number of packets, in the worst case as many packets as the interleaving pattern covers, the previously described effects disappear and normal decoding is resumed. Similar issues arise when a receiver leaves a session or has lost access to the stream. If the receiver leaves the session, this would be a minor issue since playout is normally stopped. The sender can avoid this type of problem in many sessions by starting and ending interleaving patterns correctly when risks of losses occur. One such example is a key-change done for access control to encrypted streams. If only some keys are provided to clients and there is a risk of they receiving content for which they do not have the key, it is recommended that interleaving patterns do not overlap key changes.
TOC |
If the receiver finds a mismatch between the size of a received payload and the size indicated by the ToC of the payload, the receiver SHOULD discard the packet. This is recommended because decoding a frame parsed from a payload based on erroneous ToC data could severely degrade the audio quality.
TOC |
A few examples to highlight the payload format
TOC |
The first example is a payload consisting of 3 mono frames where the 2 first frames correspond to a bitrate of 32kbps (80byte/frame) and the last is 48kbps (120byte/frame).
The first 32 bits are ToC fields. Bit 0 is '1' as another ToC field follow. Bits 1..5 is 01000 = 80bytes/frame Bits 8..15 is 00000010 = 2 frame-blocks with 80bytes/frame Bit 16 is '0', no more ToC follows Bits 17..21 is 01100 = 120 bytes/frame Bits 24..31 = 00000001 = 1 frame-block with 120bytes/frame
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|0 1 0 0 0|0 0|0 0 0 0 0 0 1 0|0|0 1 1 0 0|0 0|0 0 0 0 0 0 0 1| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |d(0) frame 1 | . . | d(639)| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |d(0) frame 2 | . . | d(639)| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |d(0) frame 3 | . . | d(959)| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
TOC |
A payload consisting of 2 stereo frames corresponding to a bitrate of 32kbps (80byte/frame) per channel. The receiver calculates the number of frames in the audio block by multiplying the value of the channels parameter (2) with the #frames field value (2) to derive that there are 4 audio frames in the payload.
The first 16 bits is the ToC field. Bit 0 is '0' as no ToC field follow. Bits 1..5 is 01000 = 80bytes/frame Bits 8..15 is 00000010 = 2 frame-blocks with 80bytes/frame
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0|0 1 0 0 0|0 0|0 0 0 0 0 0 1 0| d(0) frame 1 left ch. | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . . | d(639)| d(0) frame 1 right ch. | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . . | d(639)| d(0) frame 2 left ch. | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . . | d(639)| d(0) frame 2 right ch. | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | d(639)| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
TOC |
A payload consisting of 4 mono frames corresponding to a bitrate of 32kbps (80byte/frame) interleaved. A pattern of interleaving for constant delay when aggregating 4 frames is used in the below example. The actual packet illustrated is packet n, while the previous and following packets frame-block content is shown to illustrate the pattern.
Packet n-3: 1, 6, 11, 16 Packet n-2: 5, 10, 15, 20 Packet n-1: 9, 14, 19, 24 Packet n: 13, 18, 23, 28 Packet n+1: 17, 22, 27, 32 Packet n+2: 21, 26, 31, 36 The first 16 bits is the ToC field. Bit 0 is '0' as there are no ToC field following. Bits 1..5 is 01000 = 80bytes/frame Bits 8..15 is 00000100 = 4 frame-blocks with 80bytes/frame Bits 16..19 is 0000 = DIS1 (0) Bits 20..23 is 0100 = DIS2 (4) Bits 24..27 is 0100 = DIS3 (4) Bits 28..31 is 0100 = DIS4 (4)
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0|0 1 0 0 0|0 0|0 0 0 0 0 1 0 0|0 0 0 0|0 1 0 0|0 1 0 0|0 1 0 0| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | d(0) frame 13 | . . | d(639)| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | d(0) frame 18 | . . | d(639)| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | d(0) frame 23 | . . | d(639)| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | d(0) frame 28 | . . | d(639)| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
TOC |
This RTP payload format is identified using the media type audio/g719 which is registered in accordance with [RFC4855] (Casner, S., “Media Type Registration of RTP Payload Formats,” February 2007.) and using the template of [RFC4288] (Freed, N. and J. Klensin, “Media Type Specifications and Registration Procedures,” December 2005.).
TOC |
The media type for the G.719 codec is allocated from the IETF tree since G.719 is a has the potential to become a widely used audio codec in general VoIP, teleconferencing and streaming applications. This media type registration covers real-time transfer via RTP.
Note, any unspecified parameter MUST be ignored by the receiver to ensure that additional parameters can be added in any future revision of this specification.
Type name: audio
Subtype name: g719
Required parameters: none
Optional parameters:
- interleaving:
- Indicates that interleaved mode SHALL be used for the payload. The parameter specifies the number of frame-block slots available in a de-interleaving buffer (including the frame that is ready to be consumed). Its value is equal to one plus the maximum number of frames that can precede any frame in transmission order and follow the frame in RTP timestamp order. The value MUST be greater than zero. If this parameter is not present, interleaved mode SHALL NOT be used.
- int-delay:
- The minimal media time delay in milliseconds that is needed to avoid underrun in the de-interleaving buffer before starting decoding, i.e., the difference in RTP timestamp ticks between the earliest and latest audio frame present in the de-interleaving buffer expressed in milliseconds. The value is a stream property and provided per source. The allowed values are 0 to the largest value expressible by a unsigned 16 bit integer (65535). Please note that the in practice largest value that can be used is equal to the declared size of the interleaving buffer of the receiver. If the value for some reason is larger than the receiver buffer declared by or for the receiver this value defaults to the size of the receiver buffer. For sources for which this value hasn't been provided the value defaults to the size of the receiver buffer. The format is comma separated list of SSRC ":" delay in ms pairs which in ABNF (Crocker, D. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” January 2008.) [RFC5234] is expressed as:
int-delay = "int-delay:" source-delay *("," source-delay)
source-delay = SSRC ":" delay-value
SSRC = 1*8HEXDIG ; The 32-bit SSRC encoded in hex format
delay-value = 1*5DIGIT ; The delay value in milliseconds
Example: int-delay=ABCD1234:1000,4321DCB:640
NOTE: No white space allowed in the parameter before the end of all the value pairs
- max-red:
- The maximum duration in milliseconds that elapses between the primary (first) transmission of a frame and any redundant transmission that the sender will use. This parameter allows a receiver to have a bounded delay when redundancy is used. Allowed values are between 0 (no redundancy will be used) and 65535. If the parameter is omitted, no limitation on the use of redundancy is present.
- channels:
- The number of audio channels. The possible values (1-6) and their respective channel order is specified in Section 4.1 in [RFC3551] (Schulzrinne, H. and S. Casner, “RTP Profile for Audio and Video Conferences with Minimal Control,” July 2003.). If omitted, it has the default value of 1.
- CBR:
- Constant Bit Rate (CBR), indicates the exact codec-bitrate in bits per second (not including the overhead from packetization, RTP header or lower layers) that the codec MUST use. CBR is to be used when dynamic rate cannot be supported (one case is e.g gateway to H.320). CBR is mostly used for gateways to circuit switch networks. Therefore the CBR rate is the rate not including any FEC as specified in Section 4.3.1 (Use of Forward Error Correction (FEC)). If FEC is to be used the b= parameter MUST be used to allow the extra bit rate needed to send the redundant information. It is RECOMMENDED that this parameter is only used when necessary to establish a working communication. The usage of this parameter have implications on congestion control that needs to be considered, see Section 9 (Congestion Control).
- ptime:
- see [RFC4566] (Handley, M., Jacobson, V., and C. Perkins, “SDP: Session Description Protocol,” July 2006.).
- maxptime:
- see [RFC4566] (Handley, M., Jacobson, V., and C. Perkins, “SDP: Session Description Protocol,” July 2006.).
Encoding considerations:
This media type is framed and binary, see section 4.8 in RFC4288 (Freed, N. and J. Klensin, “Media Type Specifications and Registration Procedures,” December 2005.) [RFC4288].
Security considerations:
See Section 10 (Security Considerations) of RFC XXXX.
Interoperability considerations:
The support of the Interleaving mode is not mandatory and needs to be negotiated. See Section 7.2 (Mapping to SDP) for how to that for SDP based protocols.
Published specification:
RFC XXXX
Applications that use this media type:
Real-time audio applications like voice over IP and teleconference, and multi-media streaming.
Additional information: none
Person & email address to contact for further information:
Payload format: IngemarJohansson <ingemar.s.johansson@ericsson.com>
Intended usage: COMMON
Restrictions on usage:
This media type depends on RTP framing, and hence is only defined for transfer via RTP [RFC3550] (Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, “RTP: A Transport Protocol for Real-Time Applications,” July 2003.). Transport within other framing protocols is not defined at this time.
Author:
Ingemar Johansson <ingemar.s.johansson@ericsson.com>
Magnus Westerlund <magnus.westerlund@ericsson.com>
Change controller:
IETF Audio/Video Transport working group delegated from the IESG.
Additional Information:
File storage of G.719 encoded audio in ISO base media file format is specified in Annex A of [ITU‑T‑G719] (ITU-T, “Specification : ITU-T G.719 extension for 20 kHz fullband audio,” April 2008.). Thus media file formats such as MP4 (audio/mp4 or video/mp4) [RFC4337] (Y Lim and D. Singer, “MIME Type Registration for MPEG-4,” March 2006.) and 3GP (audio/3GPP and video/3GPP) [RFC3839] (Castagno, R. and D. Singer, “MIME Type Registrations for 3rd Generation Partnership Project (3GPP) Multimedia files,” July 2004.) can contain G.719 encoded audio.
TOC |
The information carried in the media type specification has a specific mapping to fields in the Session Description Protocol (SDP) [RFC4566] (Handley, M., Jacobson, V., and C. Perkins, “SDP: Session Description Protocol,” July 2006.), which is commonly used to describe RTP sessions. When SDP is used to specify sessions employing the G.719 codec, the mapping is as follows:
TOC |
The following considerations apply when using SDP Offer-Answer procedures to negotiate the use of G.719 payload in RTP:
TOC |
In declarative usage, like SDP in RTSP [RFC2326] (Schulzrinne, H., Rao, A., and R. Lanphier, “Real Time Streaming Protocol (RTSP),” April 1998.) or SAP [RFC2974] (Handley, M., Perkins, C., and E. Whelan, “Session Announcement Protocol,” October 2000.), the parameters SHALL be interpreted as follows:
TOC |
One media type (audio/g719) has been defined and needs registration in the media types registry; see Section 7.1 (Media Type Definition).
TOC |
The general congestion control considerations for transporting RTP data apply; see RTP [RFC3550] (Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, “RTP: A Transport Protocol for Real-Time Applications,” July 2003.) and any applicable RTP profile like AVP [RFC3551] (Schulzrinne, H. and S. Casner, “RTP Profile for Audio and Video Conferences with Minimal Control,” July 2003.). However, the multi-rate capability of G.719 audio coding provides a mechanism that may help to control congestion, since the bandwidth demand can be adjusted (within the limits of the codec) by selecting a different encoding bit-rate.
The number of frames encapsulated in each RTP payload highly influences the overall bandwidth of the RTP stream due to header overhead constraints. Packetizing more frames in each RTP payload can reduce the number of packets sent and hence the header overhead, at the expense of increased delay and reduced error robustness. If forward error correction (FEC) is used, the amount of FEC-induced redundancy needs to be regulated such that the use of FEC itself does not cause a congestion problem.
The CBR signalling parameter allows a receiver to lock down a RTP payload type to use a single encoding rate. As this prevents the codec rate from being lowered when congestion is experienced, the sender is constrained to either change the packetization or abort the transmission. Since these responses to congestion are severely limited, implementations SHOULD NOT use the CBR parameter unless they are interacting with a device that cannot support variable bit rate (e.g. a gateway to H.320 systems). When using CBR mode, a receiver MUST monitor the packet loss rate to ensure congestion is not caused, following the guidelines in Section 2 of RFC 3551.
TOC |
RTP packets using the payload format defined in this specification are subject to the general security considerations discussed in RTP [RFC3550] (Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, “RTP: A Transport Protocol for Real-Time Applications,” July 2003.) and any applicable profile such as AVP [RFC3551] (Schulzrinne, H. and S. Casner, “RTP Profile for Audio and Video Conferences with Minimal Control,” July 2003.) or SAVP [RFC3711] (Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. Norrman, “The Secure Real-time Transport Protocol (SRTP),” March 2004.). As this format transports encoded audio, the main security issues include confidentiality, integrity protection, and data origin authentication of the audio itself. The payload format itself does not have any built-in security mechanisms. Any suitable external mechanisms, such as SRTP [RFC3711] (Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. Norrman, “The Secure Real-time Transport Protocol (SRTP),” March 2004.), MAY be used.
This payload format and the G.719 decoder do not exhibit any significant non-uniformity in the receiver-side computational complexity for packet processing, and thus are unlikely to pose a denial-of-service threat due to the receipt of pathological data. The payload format or the codec data does not contain any type of active content such as scripts.
TOC |
In order to ensure confidentiality of the encoded audio, all audio data bits MUST be encrypted. There is less need to encrypt the payload header or the table of contents since they only carry information about the frame type. This information could also be useful to a third party, for example, for quality monitoring.
The use of interleaving in conjunction with encryption can have a negative impact on confidentiality, for a short period of time. Consider the following packets (in brackets) containing frame numbers as indicated: {10, 14, 18}, {13, 17, 21}, {16, 20, 24} (a popular continuous diagonal interleaving pattern). The originator wishes to deny some participants the ability to hear material starting at time 16. Simply changing the key on the packet with the timestamp at or after 16, and denying that new key to those participants, does not achieve this; frames 17, 18, and 21 have been supplied in prior packets under the prior key, and error concealment may make the audio intelligible at least as far as frame 18 or 19, and possibly further.
TOC |
To authenticate the sender of the audio-stream, an external mechanism MUST be used. It is RECOMMENDED that such a mechanism protects both the complete RTP header and the payload (audio and data bits). Data tampering by a man-in-the-middle attacker could replace audio content and also result in erroneous depacketization/decoding that could lower the audio quality.
TOC |
The authors would like to thank Roni Even and Anisse Taleb for their help with this draft. We would also like to thank the people that has provided feedback; Colin Perkins, Mark Baker and Stephen Botzko.
TOC |
TOC |
TOC |
[I-D.ietf-tsvwg-udp-guidelines] | Eggert, L. and G. Fairhurst, “Unicast UDP Usage Guidelines for Application Designers,” draft-ietf-tsvwg-udp-guidelines-11 (work in progress), October 2008 (TXT). |
[ITU-T-G719] | ITU-T, “Specification : ITU-T G.719 extension for 20 kHz fullband audio,” April 2008. |
[RFC2119] | Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels,” BCP 14, RFC 2119, March 1997 (TXT, HTML, XML). |
[RFC3264] | Rosenberg, J. and H. Schulzrinne, “An Offer/Answer Model with Session Description Protocol (SDP),” RFC 3264, June 2002 (TXT). |
[RFC3550] | Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, “RTP: A Transport Protocol for Real-Time Applications,” STD 64, RFC 3550, July 2003 (TXT, PS, PDF). |
[RFC3551] | Schulzrinne, H. and S. Casner, “RTP Profile for Audio and Video Conferences with Minimal Control,” STD 65, RFC 3551, July 2003 (TXT, PS, PDF). |
[RFC4566] | Handley, M., Jacobson, V., and C. Perkins, “SDP: Session Description Protocol,” RFC 4566, July 2006 (TXT). |
[RFC5234] | Crocker, D. and P. Overell, “Augmented BNF for Syntax Specifications: ABNF,” STD 68, RFC 5234, January 2008 (TXT). |
TOC |
Magnus Westerlund | |
Ericsson AB | |
Torshamnsgatan 21-23 | |
SE-164 83 Stockholm | |
SWEDEN | |
Phone: | +46 8 7190000 |
Email: | magnus.westerlund@ericsson.com |
Ingemar Johansson | |
Ericsson AB | |
Laboratoriegrand 11 | |
SE-971 28 Lulea | |
SWEDEN | |
Phone: | +46 73 0783289 |
Email: | ingemar.s.johansson@ericsson.com |
TOC |
Copyright © The IETF Trust (2008).
This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights.
This document and the information contained herein are provided on an “AS IS” basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org.