AVTCORE | J. Lennox |
Internet-Draft | Vidyo |
Updates: 3550 (if approved) | M. Westerlund |
Intended status: Standards Track | Ericsson |
Expires: January 08, 2013 | July 09, 2012 |
Real-Time Transport Protocol (RTP) Considerations for Endpoints Sending Multiple Media Streams
draft-lennox-avtcore-rtp-multi-stream-00
This document expands and clarifies the behavior of the Real-Time Transport Protocol (RTP) endpoints when they are sending multiple media streams in a single RTP session. In particular, issues involving Real-Time Transport Control Protocol (RTCP) messages are described.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 08, 2013.
Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
At the time The Real-Time Tranport Protocol (RTP) [RFC3550] was originally written, and for quite some time after, endpoints in RTP sessions typically only transmitted a single media stream per RTP session, where separate RTP sessions were typically used for each distinct media type.
Recently, however, a number of scenarios have emerged (discussed further in Section 3) in which endpoints wish to send multiple RTP media streams, distinguished by distinct RTP synchronization source (SSRC) identifiers, in a single RTP session. Although RTP's initial design did consider such scenarios, the specification was not consistently written with such use cases in mind. The specifications are thus somewhat unclear.
The purpose of this document is to expand and clarify [RFC3550]'s language for these use cases. The authors believe this does not result in any major normative changes to the RTP specification, however this document defines how the RTP specification shall be interpreted. In these cases, this document updates RFC3550.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119] and indicate requirement levels for compliant implementations.
This section discusses several use cases that have motivated the development of endpoints that send multiple streams in a single RTP session.
The most straightforward motivation for an endpoint to send multiple media streams in a session is the scenario where an endpoint has multiple capture devices of the same media type and characteristics. For example, telepresence endpoints, of the type described by the CLUE Telepresence Framework [I-D.ietf-clue-framework] is designed, often have multiple cameras or microphones covering various areas of a room.
Recent work has been done in RTP [I-D.westerlund-avtcore-multi-media-rtp-session] and SDP [I-D.ietf-mmusic-sdp-bundle-negotiation] to update RTP's historical assumption that media streams of different media types would always be sent on different RTP sessions. In this work, a single endpoint's audio and video media streams (for example) are instead sent in a single RTP session.
There are several RTP topologies which can involve a central box which itself generates multiple media streams in a session.
One example is a mixer providing centralized compositing for a multi-capturer scenario like the one described in Section 3.1. In this case, the centralized node is behaving much like a multi-capturer endpoint, generating several similar and related sources.
More complicated is the Source Projecting Mixer, which is a central box that receives media streams from several endpoints, and then selectively forwards modified versions of some of the streams toward the other endpoints it is connected to. Toward one destination, a separate media source appears in the session for every other source connected to the mixer, "projected" from the original streams, but at any given time many of them may appear to be inactive (and thus receivers, not senders, in RTP). This box is an RTP mixer, not an RTP translator, in that it terminates RTCP reporting about the mixed streams, and it can re-write SSRCs, timestamps, and sequence numbers, as well as the contents of the RTP payloads, and can turn sources on and off at will without appearing to be generating packet loss. Each projected stream will typically preserve its original RTCP source description (SDES) information.
While an endpoint MUST (of course) stay within its share of the available session bandwidth, as determined by signalling and congestion control, this need not be applied independently or uniformly to each media stream. In particular, session bandwidth MAY be reallocated among an endpoint's media streams, for example by varying the bandwidth use of a variable-rate codec, or changing the codec used by the media stream, up to the constraints of the session's negotiated (or declared) codecs. This includes enabling or disabling media streams as more or less bandwidth becomes available.
The Real-Time Transport Control Protocol (RTCP) is defined in Section 6 of [RFC3550], but it is largely documented in terms of "participants". For multi-media-stream endpoints, it is generally most useful to interpret the specification such that each media stream is a separate "participant".
For each of an endpoint's media media streams, whether or not it is currently being sent, SR/RR and SDES packets MUST be sent at least once per RTCP report interval. (For discussion of the content of SR or RR packets' reception statistic reports, see Section 5.1.)
When a new media stream is added to a unicast session, the sentence in [RFC3550]'s Section 6.2 applies: "For unicast sessions ... the delay before sending the initial compound RTCP packet MAY be zero." Thus, endpoints MAY send an initial RTCP packet for the media stream immediately upon adding to the session.
Similarly, [RFC3550] Section 6.1 gives the following advice to RTP translators and mixers:
Feedback [RFC4585] or Extended Report (XR) [RFC3611] packet. A compound packet is the combination of two or more such RTCP packets where the first packet must be an SR or an RR packet, and which contains a SDES packet containing an CNAME item. Thus the above results in compound RTCP packets that contain multiple SR or RR packets from different sources as well as any of the other packet types. There are no restrictions on the order the packets may occur within the compound packet, except the regular compound rule, i.e. starting with an SR or RR.
Note: To avoid confusion, an RTCP packet is an individual item, such as an Sender Report (SR), Receiver Report (RR), Source Description (SDES), Goodbye (BYE), Application Defined (APP),
This advice applies to multi-media-stream endpoints as well, with the same restrictions and considerations. (Note, however, that the last sentence does not apply to AVPF [RFC4585] or SAVPF [RFC5124] feedback packets if Reduced-Size RTCP [RFC5506] is in use.)
Open Issue: Any clarifications on how one handle the scheduling of RTCP transmissions when having multiple sources? Alternatives include delaying one source to the next source's transmission, or to group multiple sources to use only one scheduling.
As required by [RFC3550], an endpoint MUST send reception reports about every active media stream it is receiving, from at least one local source.
However, a naive application of the RTP specification's rules could be quite inefficient. In particular, if a session has N media sources (active and inactive), and had S senders in each reporting interval, there would either be N*S report blocks per reporting interval, or (per the round-robinning recommendations of [RFC3550] Section 6.1) reception sources would be unnecessarily round-robinned. In a session where most media sources become senders reasonably frequently, this results in quadratically many reception report blocks in the conference, or reporting delays proportional to the number of session members.
Since traffic is received by endpoints, however, rather than by media sources, there is not actually any need for this quadratic expansion. All that is needed is for each endpoint to report all the remote sources it is receiving.
Thus, an endpoint SHOULD NOT send reception reports from one of its own media sources about another one of its own ("self-reports"). Similarly, an endpoint with multiple media sources SHOULD NOT send reception reports about a remote media source from more than one of its local sources ("cross-reports"). Instead, it SHOULD pick one of its local media sources as the "reporting" source for each remote media source, and use it to send reception reports for that remote source; all its other media sources SHOULD NOT send any reception reports for that remote media source.
An endpoint MAY choose different local media sources as the reporting source for different remote media sources (for example, it could choose to send reports about remote audio sources from its local audio source, and reports about remote video sources from its local video source), or it MAY choose a single local source for all its reports. If the reporting source leaves the session (sends BYE), another reporting source MUST be chosen. This "reporting" source SHOULD also be the source for any AVPF feedback messages about its remote sources, as well.
The RTCP traffic generated by receivers following the rules in Section 5.1 might appear, to observers unaware of the recommendations of this specification or knowledge about which end-points are associated with which SSRCs, to be generated by receivers who are experiencing a network disconnection.
This could be a potentially critical problem when one uses RTCP for congestion control, as a sender might think that it is sending so much traffic that it is causing complete congestion collapse. At the same time, however, a congestion control solution is likely not interested in performing unecessary processing based on multiple reporting sources having identical statistics. A congestion control algorithm is likely more interested in frequent reporting from one specific source than multiple sources at the same end-point based on common statistics. That would reduce the uncertainty that sources are from the same end-point, and likely improve the interarrival time of the reporting, compared to multiple SSRCs which, by the RTCP algorithm, are deliberately desynchronized. However, this would clearly require clarifications on how the RTCP timer rules are to be treated.
However, such an interpretation of the session statistics would require a fairly sophisticated RTCP analysis. Any receiver of RTCP statistics which is just interested in information about itself needs to be prepared that any given reception report might not contain information about a specific media source, because reception reports in large conferences can be round-robined.
Thus, it is unclear to what extent this restriction would actually cause trouble in practice.
If there are indeed scenarios in which the rules of Section 5.1 do cause troubles, an alternative solution would be to explicitly signal, in RTCP, which groups of media sources originate from a single endpoint. Thus, within a group of sources, receivers could know that there would not be self-reports, and only a single SSRC would be providing cross-reports. In such a mode, the signaling protocol would need to negotiate, or declare, that the mode was in use.
The next question would be to determine how to indicate the groups of sources for this purpose. The sources' CNAMEs would probably not be sufficient, as some of the use cases described in Section 3, notably the source-projecting mixer, result in a single endpoint generating sources with multiple CNAME values. Thus, a new SDES item would be needed for these purposes.
TBD: If this solution is indeed taken, define the specifics of this SDES item, and the signaling needed to indicate its use.
In the secure RTP protocol (SRTP) [RFC3711], the cryptographic context of a compound SRTCP packet is the SSRC of the sender of the first RTCP (sub-)packet. This could matter in some cases, especially for keying mechanisms such as Mikey [RFC3830] which use per-SSRC keying.
Other than that, the standard security considerations of RTP apply; sending multiple media streams from a single endpoint does not appear to have different security consequences than sending the same number of streams.
At this stage this document contains a number of open issues. The below list tries to summarize the issues:
This document makes no requests of IANA.
Note to the RFC Editor: please remove this section before publication.
(Note: This section may change if the alternative proposal of Section 5.3 is adopted.)
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. |
[RFC3550] | Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", STD 64, RFC 3550, July 2003. |
[RFC3711] | Baugher, M., McGrew, D., Naslund, M., Carrara, E. and K. Norrman, "The Secure Real-time Transport Protocol (SRTP)", RFC 3711, March 2004. |
[RFC4585] | Ott, J., Wenger, S., Sato, N., Burmeister, C. and J. Rey, "Extended RTP Profile for Real-time Transport Control Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585, July 2006. |
[RFC5124] | Ott, J. and E. Carrara, "Extended Secure RTP Profile for Real-time Transport Control Protocol (RTCP)-Based Feedback (RTP/SAVPF)", RFC 5124, February 2008. |
[RFC5506] | Johansson, I. and M. Westerlund, "Support for Reduced-Size Real-Time Transport Control Protocol (RTCP): Opportunities and Consequences", RFC 5506, April 2009. |
[I-D.ietf-mmusic-sdp-bundle-negotiation] | Holmberg, C and H Alvestrand, "Multiplexing Negotiation Using Session Description Protocol (SDP) Port Numbers", Internet-Draft draft-ietf-mmusic-sdp-bundle-negotiation-00, February 2012. |
[I-D.ietf-clue-framework] | Romanow, A, Duckworth, M, Pepperell, A and B Baldino, "Framework for Telepresence Multi-Streams", Internet-Draft draft-ietf-clue-framework-02, January 2012. |
[RFC3830] | Arkko, J., Carrara, E., Lindholm, F., Naslund, M. and K. Norrman, "MIKEY: Multimedia Internet KEYing", RFC 3830, August 2004. |
[RFC3611] | Friedman, T., Caceres, R. and A. Clark, "RTP Control Protocol Extended Reports (RTCP XR)", RFC 3611, November 2003. |
[I-D.westerlund-avtcore-multi-media-rtp-session] | Westerlund, M., Perkins, C and J. Lennox, "Multiple Media Types in an RTP Session", Internet-Draft draft-westerlund-avtcore-multi-media-rtp-session-00, July 2012. |