| Network Working Group | B. Burman | 
| Internet-Draft | M. Westerlund | 
| Intended status: Informational | Ericsson | 
| Expires: August 04, 2013 | January 31, 2013 | 
Multi-Media Concepts and Relations
  draft-burman-rtcweb-mmusic-media-structure-00
There are currently significant efforts ongoing in IETF regarding more advanced multi-media functionalities, such as the work related to RTCWEB and CLUE. This work includes use cases for both multi-party communication and multiple media streams from an individual end-point. The usage of scalable encoding or simulcast encoding as well as different types of transport mechanisms have created additional needs to correctly identify different types of resources and describe their relations to achieve intended functionalities.
The different usages have both commonalities and differences in needs and behavior. This document attempts to review some usages and identify commonalities and needs. It then continues to highlight important aspects that need to be considered in the definition of these usages.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on August 04, 2013.
Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
This document concerns itself with the conceptual structures that can be found in different logical levels of a multi-media communication, from transport aspects to high-level needs of the communication application. The intention is to provide considerations and guidance that can be used when discussing how to resolve issues in the RTCWEB and CLUE related standardization. Typical use cases for those WG have commonalities that likely should be addressed similarly and in a way that allows to align them.
The document starts with going deeper in the motivation why this has become an important problem at this time. This is followed by studies of some use cases and what concepts they contain, and concludes with a discussion of observed commonalities and important aspects to consider.
There has arisen a number of new needs and requirements lately from work such as WebRTC/RTCWEB [I-D.ietf-rtcweb-overview] and CLUE [I-D.ietf-clue-framework]. The applications considered in those WG has surfaced new requirements on the usage of both RTP [RFC3550] and existing signalling solutions.
The main application aspects that have created new needs are:
The presence of multiple media resources or components creates a need to identify, handle and group those resources across multiple different instantiations or alternatives.
There are many different existing RTP usages. This section brings up some that we deem interesting in comparison to the other use cases.
This use case is intended to function as a base-line to contrast against the rest of the use cases.
The communication context is an audio-only bi-directional communication between two users, Alice and Bob. This communication uses a single multi-media session that can be established in a number of ways, but let's assume SIP/SDP [RFC3261][RFC3264]. This multi-media session contains two end-points, one for Alice and one for Bob. Each end-point has an audio capture device that is used to create a single audio media source at each end-point.
+-------+ +-------+ | Alice |<------->| Bob | +-------+ +-------+
Figure 1: Point-to-point Audio
The session establishment (SIP/SDP) negotiates the intent to communicate over RTP using only the audio media type. Inherent in the application is an assumption of only a single media source in each direction. The boundaries for the encodings are represented using RTP Payload types in conjunction with the SDP bandwidth parameter (b=). The session establishment is also used to negotiate that RTP will be used, thus resulting in that an RTP session will be created for the audio. The underlying transport flows, in this case a single bi-directional UDP flow for RTP, another for RTCP, is configured by each end-point providing its' IP address and port, which becomes source or destination depending on in which direction the packet is sent.
The RTP session will have two RTP media streams, one in each direction, which carries the encoding of the media source the sending implementation has chosen based on the boundaries established by the RTP payload types and other SDP parameters, e.g. codec, and bit-rates. The streams are in the RTP context identified by their SSRCs.
This use case is a multi-party use case with a central conference node performing media mixing. It also includes two media types, both audio and video. The high level topology of the communication session is the following:
+-------+         +------------+           +-------+
|       |<-Audio->|            |<--Audio-->|       |
| Alice |         |            |           |  Bob  |
|       |<-Video->|            |<--Video-->|       |
+-------+         |            |           +-------+
                  |   Mixer    |
+-------+         |            |           +-------+
|       |<-Audio->|            |<--Audio-->|       |
|Charlie|         |            |           | David |
|       |<-Video->|            |<--Video-->|       |
+-------+         +------------+           +-------+
 
Figure 2: Audio and Video Conference with Centralized Mixing
The communication session is a multi-party conference including the four users Alice, Bob, Charlie, and David. This communication session contains four end-points and one middlebox (the Mixer). The communication session is established using four different multi-media sessions; one each between the user's endpoints and the middlebox. Each of these multi-media sessions uses a session establishment method, like SIP/SDP.
Looking at a single multi-media session between a user, e.g. Alice, and the Mixer, there exist two media types, audio and video. Alice has two capture devices, one video camera giving her a video media source, and an audio capture device giving an audio media source. These two media sources are captured in the same room by the same end-point and thus have a strong timing relationship, requiring inter-media synchronization at playback to provide the correct fidelity. Thus Alice's endpoint has a synchronization context that both her media sources use.
These two media sources are encoded using encoding parameters within the boundaries that has been agreed between the end-point and the Mixer using the session establishment. As has been common practice, each media type will use its own RTP session between the end-point and the mixer. Thus a single audio stream using a single SSRC will flow from Alice to the Mixer in the Audio RTP session and a single video stream will flow in the Video RTP session. Using this division in separate RTP sessions, the bandwidth of both audio and video can be unambiguously and separately negotiated by the SDP bandwidth attributes exchanged between the end-points and the mixer. Each RTP session is using its own Transport Flows. The common synchronization context across Alice's two media streams is identified by binding both streams to the same CNAME, generated by Alice's endpoint.
The mixer does not have any physical capture devices, instead it creates conceptual media sources. It provides two media sources towards Alice; one audio being a mix of the audio from Bob, Charlie and David, the second one being a conceptual video source that contains a selection of one of the other video sources received from Bob, Charlie, or David depending on who is speaking. The Mixer's audio and video sources are provided in an encoding using a codec that is supported by both Alice's endpoint and the mixer. These streams are identified by a single SSRC in the respective RTP session.
The mixer will have its own synchronization context and it will inject the media from Bob, Charlie and David in a synchronized way into the mixer's synchronization context to maintain the inter-media synchronization of the original media sources.
The mixer establishes independent multimedia sessions with each of the participant's endpoints. The mixer will in most cases also have unique conceptual media sources for each of the endpoints. This as audio mixes and video selections typically exclude media sources originating from the receiving end-point. For example, Bob's audio mix will be a mix of Alice, Charlie and David, and will not contain Bob's own audio.
This use case may need unique user identities across the whole communication session. An example functionality of this is a participant list which includes audio energy levels showing who is speaking within the audio mix. If that information is carried in RTP using the RTP header extension for Mixer to audio clients [RFC6465] then contributing source identities in the form of CSRC need to be bound to the other end-point's media sources or user identities. This despite the fact that each RTP session towards a particular user's endpoint is terminated in the RTP mixer. This points out the need for identifiers that exist in multiple multi-media session contexts. In most cases this can easily be solved by the application having identities tailored specifically for its own needs, but some applications will benefit from having access to some commonly defined structure for media source identities.
This use case is similar to the one above [sec-uc-legacy-conf], with the difference that the mixer does not mix media streams by decoding, mixing and re-encoding them, but rather switches a selection of received media more or less unmodified towards receiving end-points. This difference may not be very apparent to the end-user, but the main motivations to eliminate the mixing operation and switch rather than mix are:
Without the mixing operation, the mixer has limited ability to create conceptual media sources that are customized for each receiver. The reasons for such customizations comes from sender and receiver differences in available resources and preferences:
To enable elimination of the mixing operation, media sent to the mixer must sufficiently well meet the above constraints for all intended receivers. There are several ways to achieve this. One way is to, by some system-wide design, ensure that all senders and receivers are basically identical in all the above aspects. This may however prove unrealistic when variations in conditions and end-points are too large. Another way is to let a sender provide a (small) set of alternative representations for each sent media source, enough to sufficiently well cover the expected range of variation. If those media source representations, encodings, are independent from one another, they constitute a Simulcast of the media source. If an encoding is instead dependent on and thus requires reception of one or more other encodings, the representation of the media source jointly achieved by all dependent encodings is said to be Scalable. Simulcast and Scalable encoding can also be combined.
Both Simulcast and Scalable encodings result in that a single media source generates multiple RTP media streams of the same media type. The division of bandwidth between the Simulcast or Scalable streams for a single media source is application specific and will vary. The total bandwidth for a Simulcast or a Scalable source is the sum of all included RTP media streams. Since all streams in a Simulcast or Scalable source originate from the same capture device, they are closely related and should thus share synchronization context.
The first and second customizations listed above, presenting multiple conference users simultaneously, aligned with the presentation needs in the receiver, can also be achieved without mixing operation by simply sending appropriate quality media from those users individually to each receiver. The total bandwidth of this user presentation aggregate is the sum of all included RTP media streams. Audio and video from a single user share synchronization context and can be synchronized. Streams that originate from different users do not have the same synchronization context, which is acceptable since they do not need to be synchronized, but just presented jointly.
An actual mixer device need not be either mixing-only or switching-only, but may implement both mixing and switching and may also choose dynamically what to do for a specific media and a specific receiving user on a case-by-case basis or based on some policy.
This section brings up two different instantiations of WebRTC [ref-webrtc10] that stresses different aspects. But let's start with reviewing some important aspects of WebRTC and the MediaStream [ref-media-capture] API.
In WebRTC, an application gets access to a media source by calling getUserMedia(), which creates a MediaStream [ref-media-capture] (note the capitalization). A MediaStream consists of zero or more MediaStreamTracks, where each MediaStreamTrack is associated with a media source. These locally generated MediaStreams and their tracks are connected to local media sources, which can be media devices such as video cameras or microphones, but can also be files.
An WebRTC PeerConnection (PC) is an association between two endpoints that is capable of communicating media from one end to the other. The PC concept includes establishment procedures, including media negotiation. Thus a PC is an instantiation of a Multimedia Session. When one end-point adds a MediaStream to a PC, the other endpoint will by default receive an encoded representation of the MediaStream and the active MediaStreamTracks.
This is a use case of WebRTC which establishes a multi-party communication session by establishing an individual PC with each participant in the communication session.
+---+      +---+
| A |<---->| B |
+---+      +---+
  ^         ^   
   \       /    
    \     /     
     v   v      
     +---+      
     | C |      
     +---+      
 
Figure 3: WebRTC Mesh-based Multi-party
Users A, B and C want to have a joint communication session. This communication session is created using a Web-application without any central conference functionality. Instead, it uses a mesh of PeerConnections to connect each participant's endpoint with the other endpoints. In this example, three double-ended connections are required to connect the three participants, and each endpoint has two PCs.
This is an audio and video communication and each end-point has one video camera and one microphone as media sources. Each endpoint creates its own MediaStream with one video MediaStreamTrack and one audio MediaStreamTrack. The endpoints add their MediaStream to both of their PCs.
Let's now focus on a single PC; in this case the one established between A and B. During the establishment of this PC, the two endpoints agree to use only a single transport flow for all media types, thus a single RTP session is created between A and B. A's MediaStream has one audio media source that is encoded according to the boundaries established by the PeerConnection establishment signalling, which includes the RTP payload types and thus Codecs supported as well as bit-rate boundaries. The encoding of A's media source is then sent in an RTP stream identified by a unique SSRC. In this case, as there are two media sources at A, two encodings will be created which will be transmitted using two different RTP streams with their respective SSRC. Both these streams will reference the same synchronization context through a common CNAME identifier used by A. B will have the same configuration, thus resulting in at least four SSRC being used in the RTP session part of the A-B PC.
Depending on the configuration of the two PCs that A has, i.e. the A-B and the A-C ones, A could potentially reuse the encoding of a media source in both contexts, under certain conditions. First, a common codec and configuration needs to exist and the boundaries for these configurations must allow a common work point. In addition, the required bandwidth capacity needs to be available over the paths used by the different PCs. Both of those conditions are not always true. Thus it is quite likely that the endpoint will sometimes instead be required to produce two different encodings of the same media source.
If an application needs to reference the media from a particular endpoint, it can use the MediaStream and MediaStreamTrack as they point back to the media sources at a particular endpoint. This as the MediaStream has a scope that is not PeerConnection specific.
The programmer can however implement this differently while supporting the same use case. In this case the programmer creates two MediaStreams that each have MediaStreamTracks that share common media sources. This can be done either by calling getUserMedia() twice, or by cloning the MediaStream obtained by the only getUserMedia() call. In this example the result is two MediaStreams that are connected to different PCs. From an identity perspective, the two MediaStreams are different but share common media sources. This fact is currently not made explicit in the API.
This section concerns itself with endpoints that have more than one media source for a particular media type. A straightforward example would be a laptop with a built in video camera used to capture the user and a second video camera, for example attached by USB, that is used to capture something else the user wants to show. Both these cameras are typically present in the same sound field, so it will be common to have only a single audio media source.
A possible way of representing this is to have two MediaStreams, one with the built in camera and the audio, and a second one with the USB camera and the audio. Each MediaStream is intended to be played with audio and video synchronized, but the user (local or remote) or application is likely to switch between the two captures.
It becomes important for a receiving endpoint that it can determine that the audio in the two MediaStreams have the same synchronization context. Otherwise a receiver may playback the same media source twice, with some time overlap, at a switch between playing the two MediaStreams. Being able to determine that they are the same media source further allow for removing redundancy by having a single encoding if appropriate for both MediaStreamTracks.
WebRTC endpoints can relay a received MediaStream from one PC to another by the simple API level maneuver of adding the received MediaStream to the other PC. To realize this in the implementation is more complex. This can also cause some issues from a media perspective. If an application spanning across multiple endpoints that relay media between each other makes a mistake, a media loop can be created. Media Loops could become a significant issue. For example could an audio echo be created, i.e. an endpoint receives its own media without detecting that it is its own media and plays it back with some delay. In case a WebRTC endpoint produces a conceptual media source by mixing incoming MediaStreams, if there is no loop detection, a feedback loop can be created.
RTP has loop detection to detect and handle such cases within a single RTP session. However, in the context of WebRTC, the RTP session is local to the PC and thus cannot rely on the RTP level loop detection. Instead, if this protection is needed on the WebRTC MediaStream level, it could for example be achieved by having media source identifiers that can be preserved between the different MediaStreams in the PCs.
When relaying media and in case one receives multiple encodings of the same source it is beneficial to know that. For example, if one encoding arrives with a delay of 80 ms and another with 450 ms, being able to choose the one with 80 ms and not be forced to delay all media sources from the same synchronization context to the most delayed source improves performance.
In this section we look at a use case applying simulcast from each user's endpoint to a central conference node to avoid the need for an individual encoding to each receiving endpoint. Instead, the central node chooses which of the available encodings that is forwarded to a particular receiver, like in Section 3.1.3.
+-----------+      +------------+ Enc2 +---+ 
| A   +-Enc1|----->|            |----->| B | 
|     |     |      |            |      +---+ 
| Src-+-Enc2|----->|            | Enc1 +---+ 
+-----------+      |   Mixer    |----->| C | 
                   |            |      +---+ 
                   |            | Enc2 +---+ 
                   |            |----->| D | 
                   +------------+      +---+
Figure 4
In this Communication Session there are four users with endpoints and one middlebox (The Mixer). This is an audio and video communication session. The audio source is not simulcasted and the endpoint only needs to produce a single encoding. For the video source, each endpoint will produce multiple encodings (Enc1 and Enc2 in Figure 4) and transfer them simultaneously to the mixer. The mixer picks the most appropriate encoding for the path from the mixer to each receiving client.
Currently there exists no specified way in WebRTC to realise the above, although use-cases and requirements discuss simulcast functionality. The authors believe there exist two possible solution alternatives in the WebRTC context:
Both of these solutions share a common requirement, the need to separate the received RTP streams not only based on media source, but also on the encoding. However, on an API level the solutions appear different. For Multiple Encodings within the context of a PC, the receiver will need new access methods for accessing and manipulating the different encodings. Using multiple PC instead requires that one can easily determine the shared (simulcasted) media source despite receiving it in multiple MediaStreams on different PCs. If the same MediaStream is added to both PC's the id's of the MediaStream and MediaStreamTracks will be the same, while they will be different if different MediaStream's (but representing the same sources) are added to the two PC's.
The CLUE framework [I-D.ietf-clue-framework] and use case [I-D.ietf-clue-telepresence-use-cases] documents make use of most, if not all, media concepts that were already discussed in previous sections, and adds a few more.
A communicating CLUE Endpoint can, compared to other types of Endpoints, be characterized by using multiple media resources:
To make the multitude of resources more manageable, CLUE introduces some additional structures. For example, related media sources in a multimedia session are grouped into Scenes, which can generally be represented in different ways, described by alternative Scene Entries. CLUE explicitly separates the concept of a media source from the encoded representations of it and a single media source can be used to create multiple Encodings. It is also possible in CLUE to account for constraints in resource handling, like limitations in possible Encoding combinations due to physical device implementation.
The number of media resources typically differ between Endpoints. Specifically, the number of available media resources of a certain type used for sending at the sender side typically does not match the number of corresponding media resources used for receiving at the receiver side. Some selection process must thus be applied either at the sender or the receiver to select a subset of resources to be used. Hence, each resource that need to be part of that selection process must have some identification and characterization that can be understood by the selecting party. In the CLUE model, the sender (Provider) announces available resources and the receiver (Consumer) chooses what to receive. This choice is made independently in the two directions of a bi-directional communication.
The definition of a single CLUE Endpoint in the framework [I-D.ietf-clue-framework] says it can consist of several physical devices with source and sink media streams. This means that each logical node of such distributed Endpoint can have a separate transport interface, and thus that media sources originating from the same Endpoint can have different transport addresses.
This section discusses some conclusions the authors make based on the use cases. First we will discuss commonalities between use cases. Secondly we will provide a summary of issues we see affect WebRTC. Lastly we consider aspects that need to be considered in the SDP evolution that is ongoing.
The above use cases illustrate a couple of concepts that are not well defined, nor have they fully specified standard mechanisms or behaviors. This section contains a discussion of such concepts, which the authors believe are useful in more than one context and thus should be defined to provide a common function when needed by multi-media communication applications.
In several of the above use cases there exist a need for a separation between the media source, the particular encoding and its transport stream. In vanilla RTP there exist a one-to-one mapping between these; one media source is encoded in one particular way and transported as one RTP stream using a single SSRC in a particular RTP session.
The reason for not keeping a strict one-to-one mapping, allowing the media source to be identified separately from the RTP media stream (SSRC), varies depending on the application's needs and the desired functionalities:
The second argument above can be generalized into a common need in applications that utilize multiple multimedia sessions, such as multiple PeerConnections or multiple SIP/SDP-established RTP sessions, to form a larger communication session between multiple endpoints. These applications commonly need to track media sources that occur in more than one multimedia session.
Looking at both CLUE and WebRTC, they appear to contain their own variants of the concept that was above denoted a media source. In CLUE it is called Media Capture. In WebRTC each MediaStreamTrack is identifiable, however, several MediaStreamTracks can share the actual source, and there is no way for the application to realize this currently. The identification of sources is being discussed, and there is a proposal [ref-leithead] that introduces the concept 'Track Source'. Thus, in this document we see the media source as the generalized commonality between these two concepts. Giving each media source a unique identifier in the communication session/context that is reused in all the PeerConnections or SIP/SDP-established RTP sessions would enable loop detection, correctly associate alternative encodings and provide a common name across the endpoints for application logic to reference the actual media source rather than a particular encoding or transport stream.
It is arguable if the application should really know a long term persistent source identification, such as based on hardware identities, for example due to fingerprinting issues, and it would likely be better to use an anonymous identification that is still unique in a sufficiently wide context, for example within the communication application instance.
An Encoding is a particular encoded representation of a particular media source. In the context of RTP and Signalling, a particular encoding must fit the established parameters, such as RTP payload types, media bandwidths, and other more or less codec-specific media constraints such as resolution, frame-rate, fidelity, audio bandwidth, etc.
In the context of an application, it appears that there are primarily two considerations around the use of multiple encodings.
The first is how many and what their defining parameters are. This may require to be negotiated, something the existing signalling solutions, like SDP, currently lack support for. For example in SDP, there exist no way to express that you would like to receive three different encodings of a particular video source. In addition, if you for example prefer these three encodings to be 720p/25 Hz, 360p/25 Hz and 180p/12.5 Hz, and even if you could define RTP payload types with these constraints, they must be linked to RTP streams carrying the encodings of the particular source. Also, for some RTP payload types there exist difficulties to express encoding characteristics with the desired granularity. The number of RTP payload types that can be used for a particular potential encoding can also be a constraint, especially as a single RTP payload type could well be used for all three target resolutions and frame rates in the example. Using multiple encodings might even be desirable for multi-party conferences that switches video, rather than composites and re-encodes it. It might be that SDP is not the most suitable place to negotiate this. From an application perspective, utilizing clients that have standardized APIs or protocols to control them, there exist a need for the application to express what it prefers in number of encodings as well as what their primary target parameters are.
Secondly, some applications may need explicit indication of what encoding a particular stream represents. In some cases this can be deduced based on information such as RTP payload types and parameters received in the media stream, but such implicit information will not always be detailed enough and it may also be time-consuming to extract. For example, in SDP there is currently limitations for binding the relevant information about a particular encoding to the corresponding RTP stream, unless only a single RTP stream is defined per media description (m= line).
The CLUE framework explicitly discusses encodings as constraints that are applied when transforming a media source (capture) into what CLUE calls a capture encoding. This includes both explicit identification as well as a set of boundary parameters such as maximum width, height, frame rate as well as bandwidth. In WebRTC nothing related has yet been defined, and we note this as an issue that needs to be resolved. This as the authors expect that support for multiple encodings will be required to enable simulcast and scalability.
The shortcomings around synchronization contexts appears rather limited. In RTP, each RTP media stream is associated with a particular synchronization context through the CNAME session description item. The main concerns here are likely twofold.
The first concern is to avoid unnecessary creation of new contexts, and rather correctly associate with the contexts that actually exist. For example, WebRTC MediaStreams are defined so that all MediaStreamTracks within a particular MediaStream shall be synchronized. An easy method for meeting this would be to assign a new CNAME for each MediaStream. However, that would ignore the fact that several media sources from the same synchronization context may appear in different combinations across several MediaStreams. Thus all these MediaStreams should share synchronization context to avoid playback glitches, like playing back different instantiations of a single media source out of sync because the media source was shared between two different MediaStreams.
The second problem is that synchronization context identification in RTP, i.e. CNAME, is overloaded as an endpoint identifier. As an example, consider an endpoint that has two synchronization contexts; one for audio and video in the room and another for an audio and video presentation stream, like the output of an DVD player. Relying on that an endpoint has only a single synchronization context and CNAME may be incorrect and could create issues that an application designer as well as RTP and signalling extension specifications need to watch out for.
CLUE discusses so far quite little about synchronization, but clearly intends to enable lip synchronization between captures that have that relation. The second issue is however quite likely to be encountered in CLUE due to explicit inclusion of the Scene concept, where different Scenes do not require to share the same synchronization context, but is rather intended for situations where Scenes cannot share synchronization context.
When an endpoint consists of multiple nodes, the added complexity is often local to that endpoint, which is appropriate. However, some few properties of distributed endpoints needs to be tolerated by all entities in a multimedia communication session. The main item is to not assume that a single endpoint will only use a single network address. This is a dangerous assumption even for non-distributed endpoints due to multi-homing and the common deployment of NATs, especially large scale NATs which in worst case uses multiple addresses for a single endpoint's transport flows.
Distributed endpoints are brought up in the CLUE context. They are not specifically discussed in the WebRTC context, instead the desire for transport level aggregation makes such endpoints problematic. However, WebRTC does allow for fallback to media type specific transport flows and can thus without issues support distributed endpoints.
In the process of identifying commonalities and differences between the different use cases we have identified what to us appears to be issues in the current specification of WebRTC that needs to be reviewed.
These issues need to be discussed and an appropriate way to resolve them must be chosen.
The joint MMUSIC / RTCWeb WGs interim meeting in February 2013 will discuss a number of SDP related issues around the handling of multiple sources; the aggregation of multiple media types over the same RTP session as well as RTP sharing its transport flow not only with ICE/STUN but also with the WebRTC data channel using SCTP/DTLS/UDP. These issues will potentially result in a significant impact on SDP. It may also impact other ongoing work as well as existing usages and applications, making these discussions difficult.
The above use cases and discussion points to the existence of a number of commonalities between WebRTC and CLUE, and that a solution should preferably be usable by both. It is a very open question how much functionality CLUE requires from SDP, as CLUE WG plans to develop a protocol with a different usage model. The appropriate division in functionality between SDP and this protocol is currently unknown.
Based on this document, it is possible to express some protocol requirements when negotiating multimedia sessions and their media configurations. Note that this is written as requirements to consider, given that one believes this functionality is needed in SDP.
The Requirements:
The above are general requirements and in some cases the appropriate point to address the requirement may not even be SDP. For example, media source identification could primarily be put in an RTCP Session Description (SDES) item, and only when so required by the application also be included in the signalling.
The discussion in this document has impact on the high level decision regarding how to relate RTP media streams to SDP media descriptions. However, as it is currently presenting concepts rather than giving concrete proposals on how to enable these concepts as extensions to SDP or other protocols, it is difficult to determine the actual impact that a high level solution will have. However, the authors are convinced that neither of the directions will prevent the definition of suitable concepts in SDP.
This document makes no request of IANA.
Note to RFC Editor: this section may be removed on publication as an RFC.
The realization of the proposed concepts and the resolution will have security considerations. However, at this stage it is unclear if any has not already common considerations regarding preserving privacy, confidentiality and ensure integrity to prevent denial of service or quality degradations.