MMUSIC A. B. Roach
Internet-Draft Mozilla
Intended status: Informational April 11, 2013
Expires: October 13, 2013

Thoughts on syntax for representing multiple media streams

Abstract

This document briefly explores the ramifications of combining multiple media streams into one SDP m= section versus expressing each in its own m= section.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on October 13, 2013.

Copyright Notice

Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Introduction

As part of the ongoing RTCWEB and CLUE work, it has become clear that the current mechanisms in SDP are insufficient for describing complex sessions with multiple streams. Two competing schools of thought have emerged. One holds that the m= lines should apply to RTP sessions, regardless of how many media streams they contain. Another holds that m= lines should apply to media streams exclusively, and that an additional mechanism should be applied to combine multiple streams into a single RTP session, if necessary.

2. Alternatives

2.1. Alternative 1: Multiple streams per m= section

One approach to specifying multiple streams in a single RTP session is to put information for several streams into a single m= section; and, by doing do, implicitly combine them into a single session.

To maintain some level of backwards compatibility with SDP, this approach might choose to have one m= section for audio and a second for video (with additional m= sections for other media types if they are used in the future), combining those sections with a=group:BUNDLE [I-D.ietf-mmusic-sdp-bundle-negotiation]; we will call this "Alternative 1a". An alternate approach would be the definition of a new media type which effectively allows transmission of any kind of media, thereby avoiding the need to bundle multiple sections together at all. A syntax for such an approach is proposed by [I-D.holmberg-mmusic-sdp-mmt-negotiation]. We will call this "Alternative 1b".

In both of the cases described above, certain SDP attributes might be targeted at only one of the streams in an RTP session. These attributes can be matched up with individual streams using the "a=ssrc" extension defined in [RFC5576].

For "Alternative 1a", we have the additional challenge of specifying attributes that apply to the entire RTP session, such as a=rtcp-fb and ICE candidate parameters. One approach would be inclusion of such parameters only in the first m= section within a bundle, with the implication that they apply to the entire session.

2.1.1. Alternative 1a: One section per RTP session per type

v=0
o=- 2890844526 2890844526 IN IP4 host.example.com
s=
c=IN IP4 host.example.com
t=0 0
a=group:BUNDLE c1 c2
m=audio 10000 RTP/AVP 0 8 97
a=mid:c1
a=candidate:0 1 UDP 2113601791 192.0.2.240 51091 typ host
a=candidate:1 1 UDP 1694194431 198.51.100.32 51091 typ srflx raddr
   192.0.2.240 rport 51091
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:97 iLBC/8000
a=ssrc:11111 label:speaker-audio
a=ssrc:22222 label:floor-mic
m=video 10000 RTP/AVP 31 32
a=mid:c2
a=rtpmap:31 H261/90000
a=rtpmap:32 MPV/90000
a=ssrc:33333 label:speaker-video
a=ssrc:44444 label:slides

2.1.2. Alternative 1b: One section per RTP session

v=0
o=- 2890844526 2890844526 IN IP4 host.example.com
s=
c=IN IP4 host.example.com
t=0 0
a=group:MMT foo bar zoe
m=anymedia 10000 RTP/AVP 0 8 97 31 32
a=candidate:0 1 UDP 2113601791 192.0.2.240 51091 typ host
a=candidate:1 1 UDP 1694194431 198.51.100.32 51091 typ srflx raddr
   192.0.2.240 rport 51091
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:97 iLBC/8000
a=rtpmap:31 H261/90000
a=rtpmap:32 MPV/90000
a=mmtype:0 audio
a=mmtype:8 audio
a=mmtype:97 audio
a=mmtype:31 video
a=mmtype:32 video
a=ssrc:11111 label:speaker-audio
a=ssrc:22222 label:floor-mic
a=ssrc:33333 label:speaker-video
a=ssrc:44444 label:slides

2.2. Alternative 2: Single stream per m= section

An alternate proposal is constraining one m= section to talk about a single media stream. Like alternative 1a, above, the BUNDLE extension is used to combine several m= sections into a single RTP session. Any attributes that are applicable to a single media stream can be correlated by putting them in the corresponding m= section. Any attributes that apply to the transport parameters (e.g., rtcp-fb, ICE parameters) are conveyed in the first m= section within the bundle (alternate schemes are possible, but this seems the simplest and most straightforward).

v=0
o=- 2890844526 2890844526 IN IP4 host.example.com
s=
c=IN IP4 host.example.com
t=0 0
a=group:BUNDLE c1 c2 c3 c4
m=audio 10000 RTP/AVP 0 8 97
a=mid:c1
a=label:speaker-audio
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:97 iLBC/8000
a=candidate:0 1 UDP 2113601791 192.0.2.240 51091 typ host
a=candidate:1 1 UDP 1694194431 198.51.100.32 51091 typ srflx raddr
   192.0.2.240 rport 51091
m=audio 10000 RTP/AVP 0 8 97
a=mid:c2
a=label:floor-mic
a=rtpmap:0 PCMU/8000
a=rtpmap:8 PCMA/8000
a=rtpmap:97 iLBC/8000
m=video 10000 RTP/AVP 31 32
a=mid:c3
a=label:speaker-video
a=rtpmap:31 H261/90000
a=rtpmap:32 MPV/90000
m=video 10000 RTP/AVP 31 32
a=mid:c4
a=label:slides
a=rtpmap:31 H261/90000
a=rtpmap:32 MPV/90000

2.3. Pros and Cons

2.3.1. Codec Selection

Currently, in SDP and the various documents that rely on it (such as [RFC3264]), there are certain assumptions made about the ordinality of streams to m= sections. Consider, for example, wanting to convey two audio streams with a low-bandwidth voice codec preferred for one, but a high-quality codec preferred for the other. RFC 3264 has rules indicating that codecs are conveyed in the order of their preference. With alternative 2, it is trivial to provide different ordering (or even a different set) of codecs to achieve such a goal. Alternatives 1a and 1b lack the ability to do so without additional extensions.

This set of facts supports alternative 2 in preference to alternatives 1a and 1b.

2.3.2. Port Number Handling

When multiple sections are used to represent a single session, we need to make a decision regarding the port number conveyed in the m= line itself. One option is to use the same port number in all related m= sections. According to Cullen Jennings, this interacts very poorly with existing implementations that use SDP. The other alternative is to indicate bogus port numbers in all (or all but one) of the m= lines. According to Hadriel Kaplan, this usage will lead to certain media intermediaries destroying the session when it determines that a signaled port is going unused.

Alternative 1b avoids this problem altogether by having only one m= per IP/port combination, thereby completely sidestepping the question of what to put in subsequent m= lines.

This set of facts supports alternative 1b in preference to alternatives 1a and 2.

2.3.3. Attribute handling

Attributes that appear inside m= sections can be generally broken down into three categories: those intended to apply to a single media stream (e.g., framerate); those intended to apply to an RTP session (e.g., rtcp-fb), and those that are explicitly bound to the m= line itself (e.g., rtpmap). By and large, these attributes have been defined with an assumption that each RTP session had one stream and vice-versa.

By specifying a model that breaks this one-to-one correspondence, we have created the need to be able designate a specific media stream within an RTP session (for alternatives 1a and 1b), or the need to be able to talk about session-level attributes (for alternatives 1a and 2).

Alternatives 1a and 1b can perform stream-level designation through the use of the ssid attribute specified in [RFC5576]. Alternatives 1a and 2 can apply a convention that any RTP-session-level attributes are placed in the first m= section in a bundle (although other, more complicated approaches may also be possible).

Note, in particular, that alternative 1a inherits both problems of being able to designate attributes as applying to a single stream, as well as being able to talk about session-level attributes when multiple m=lines are bundled together.

This set of facts supports alternatives 1b and 2 in preference to alternative 1a.

2.3.4. What We're Unaware of Not Knowing

It is worth noting that the problem described in Section 2.3.1 was not discovered for quite a long time after the discussion of multiple media streams had begun. In the characterization of "known knowns," "known unknowns," and "unknown unknowns," this issue remained an unknown unknown for more than a little time.

Generally, addressing these unknown unknowns is likely to be easiest if we have the highest granularity of control. Alternative 2, by breaking each stream apart into its own instance of the control structure that has historically been used to work with media (the m= section), provides this high granularity where alternatives 1a and 1b do not.

It is the author's opinion that the probable existence of such unknown unknowns favors alternative 2 over 1a or 1b.

2.4. Red Herrings

During the course of discussing this topic, several points have been raised that, while relevant, do not bias the selection of one solution over another.

One issue that has been brought up is that SDP offer/answer requires signaling of the number of m= sections in the offer, to allow clear semantics for negotiation. Some proponents of solutions 1a and 1b have indicated a belief that allowing multiple streams per m= section avoids this restriction. This assertion has a number of problems. First, it assumes that implementations can perform reasonable operations on dynamically created media streams that begin and end without any signaling. It further assumes that the problems that the offer/answer model imposed the m-line restrictions for are no longer applicable (at least, not on a stream level). Finally, this assertion assumes that no control surfaces are necessary to talk about and/or manipulate the individual streams (alternately, if such control surfaces are introduced, then additional SDP round-trips to exchange information about those controls is necessary, making them semantically equivalent to a new offer/answer exchange -- which eliminates any purported advantage).

It has also been observed that, in addition to being sometimes applicable to streams and sometimes applicable to sessions, attribute are also sometimes unidirectional, and sometimes bidirectional. While an astute observation, this does not appear to have any bearing on the ultimate solution selected, as all three alternatives face exactly the same challenges in dealing with issues of directionality.

Finally, it should be noted that any decision to include multiple sections within a single m= section does little to simplify implementation. Even if native RTCWEB implementations generate the fewest m= sections necessary to convey their desired session state, the selection of alternatives 1a and 1b does not obviate the requirement that implementations must be able to receive SDP with several m=audio sections (for example). Inter-operation with legacy implementations, even through a gateway, will require that proper handling of such session descriptions is present in every RTCWEB implementation.

2.5. Summary

The following table summarizes the pros and cons conveyed in the preceding sections on a per-solution basis.

Issue 1a 1b 2
Section 2.3.1 - - +
Section 2.3.2 - + -
Section 2.3.3 - + +
Section 2.3.4 - - +

Based on these criteria, it is the author's belief that Alternative 2 provides the most benefit, with Alternative 1b providing a close second place.

Alternative 1a has the remarkable property of combining all of the drawbacks of solutions 1b and 2, forming a kind of "sweet-spot" of ill-advisement, and thereby maximizing the amount of work required of the MMUSIC, RTCWEB,and CLUE working groups.

3. IANA Considerations

This document makes no requests of IANA.

4. Security Considerations

The author does not believe that the syntax under discussion has an impact on the security properties of those protocols that make use of SDP.

5. Normative References

[RFC3264] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model with Session Description Protocol (SDP)", RFC 3264, June 2002.
[RFC5576] Lennox, J., Ott, J. and T. Schierl, "Source-Specific Media Attributes in the Session Description Protocol (SDP)", RFC 5576, June 2009.
[I-D.ietf-mmusic-sdp-bundle-negotiation] Holmberg, C. and H. Alvestrand, "Multiplexing Negotiation Using Session Description Protocol (SDP) Port Numbers", Internet-Draft draft-ietf-mmusic-sdp-bundle-negotiation-01, August 2012.
[I-D.holmberg-mmusic-sdp-mmt-negotiation] Holmberg, C., Alvestrand, H. and J. Lennox, "Multiplexed Media Types (MMT) Using Session Description Protocol (SDP) Port Numbers", Internet-Draft draft-holmberg-mmusic-sdp-mmt-negotiation-00, October 2012.

Author's Address

Adam Roach Mozilla Dallas, TX US EMail: adam@nostrum.com

Table of Contents