burman-rtcweb-mmusic-media-structure-00.txt

Internet DRAFT - draft-burman-rtcweb-mmusic-media-structure
draft-burman-rtcweb-mmusic-media-structure

Last Version:	draft-burman-rtcweb-mmusic-media-structure-00.txt	Tracker Entry
Date:	`02-Feb-2013`
Disposition:	expired




Network Working Group                                          B. Burman
Internet-Draft                                             M. Westerlund
Intended status: Informational                                  Ericsson
Expires: August 4, 2013                                 January 31, 2013


                   Multi-Media Concepts and Relations
             draft-burman-rtcweb-mmusic-media-structure-00

Abstract

   There are currently significant efforts ongoing in IETF regarding
   more advanced multi-media functionalities, such as the work related
   to RTCWEB and CLUE.  This work includes use cases for both multi-
   party communication and multiple media streams from an individual
   end-point.  The usage of scalable encoding or simulcast encoding as
   well as different types of transport mechanisms have created
   additional needs to correctly identify different types of resources
   and describe their relations to achieve intended functionalities.

   The different usages have both commonalities and differences in needs
   and behavior.  This document attempts to review some usages and
   identify commonalities and needs.  It then continues to highlight
   important aspects that need to be considered in the definition of
   these usages.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on August 4, 2013.

Copyright Notice

   Copyright (c) 2013 IETF Trust and the persons identified as the
   document authors.  All rights reserved.




Burman & Westerlund      Expires August 4, 2013                 [Page 1]

Internet-Draft        Media Concepts and Relations          January 2013


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Motivation . . . . . . . . . . . . . . . . . . . . . . . . . .  3
   3.  Use Cases  . . . . . . . . . . . . . . . . . . . . . . . . . .  4
     3.1.  Existing RTP Usages  . . . . . . . . . . . . . . . . . . .  4
       3.1.1.  Basic VoIP call  . . . . . . . . . . . . . . . . . . .  4
       3.1.2.  Audio and Video Conference . . . . . . . . . . . . . .  5
       3.1.3.  Audio and Video Switched Conference  . . . . . . . . .  7
     3.2.  WebRTC . . . . . . . . . . . . . . . . . . . . . . . . . .  8
       3.2.1.  Mesh-based Multi-party . . . . . . . . . . . . . . . .  9
       3.2.2.  Multi-source Endpoints . . . . . . . . . . . . . . . . 10
       3.2.3.  Media Relaying . . . . . . . . . . . . . . . . . . . . 11
       3.2.4.  Usage of Simulcast . . . . . . . . . . . . . . . . . . 11
     3.3.  CLUE Telepresence  . . . . . . . . . . . . . . . . . . . . 13
       3.3.1.  Telepresence Functionality . . . . . . . . . . . . . . 13
       3.3.2.  Distributed Endpoint . . . . . . . . . . . . . . . . . 14
   4.  Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 14
     4.1.  Commonalities in Use Cases . . . . . . . . . . . . . . . . 14
       4.1.1.  Media Source . . . . . . . . . . . . . . . . . . . . . 14
       4.1.2.  Encodings  . . . . . . . . . . . . . . . . . . . . . . 16
       4.1.3.  Synchronization contexts . . . . . . . . . . . . . . . 17
       4.1.4.  Distributed Endpoints  . . . . . . . . . . . . . . . . 18
     4.2.  Identified WebRTC issues . . . . . . . . . . . . . . . . . 18
     4.3.  Relevant to SDP evolution  . . . . . . . . . . . . . . . . 19
   5.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 20
   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 21
   7.  Informative References . . . . . . . . . . . . . . . . . . . . 21
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22











Burman & Westerlund      Expires August 4, 2013                 [Page 2]

Internet-Draft        Media Concepts and Relations          January 2013


1.  Introduction

   This document concerns itself with the conceptual structures that can
   be found in different logical levels of a multi-media communication,
   from transport aspects to high-level needs of the communication
   application.  The intention is to provide considerations and guidance
   that can be used when discussing how to resolve issues in the RTCWEB
   and CLUE related standardization.  Typical use cases for those WG
   have commonalities that likely should be addressed similarly and in a
   way that allows to align them.

   The document starts with going deeper in the motivation why this has
   become an important problem at this time.  This is followed by
   studies of some use cases and what concepts they contain, and
   concludes with a discussion of observed commonalities and important
   aspects to consider.


2.  Motivation

   There has arisen a number of new needs and requirements lately from
   work such as WebRTC/RTCWEB [I-D.ietf-rtcweb-overview] and CLUE
   [I-D.ietf-clue-framework].  The applications considered in those WG
   has surfaced new requirements on the usage of both RTP [RFC3550] and
   existing signalling solutions.

   The main application aspects that have created new needs are:

   o  Multiple Media Streams from an end-point.  The fact that an end-
      point may have multiple media capture devices, such as cameras or
      microphone mixes.

   o  Group communications involving multiple end-points.  This is
      realized using both mesh based connections as well as centralized
      conference nodes.  These creating a need for dealing with multiple
      endpoints and/or multiple streams with different origins from a
      transport peer.

   o  Media Stream Adaptation, both to adjust network resource
      consumption as well as to handle varying end-point capabilities in
      group communication.

   o  Transport mechanisms including both higher levels of aggregation
      [I-D.ietf-mmusic-sdp-bundle-negotiation]
      [I-D.ietf-avtcore-multi-media-rtp-session] and the use of
      application-level transport repair mechanisms such as forward
      error correction (FEC) and/or retransmission.




Burman & Westerlund      Expires August 4, 2013                 [Page 3]

Internet-Draft        Media Concepts and Relations          January 2013


   The presence of multiple media resources or components creates a need
   to identify, handle and group those resources across multiple
   different instantiations or alternatives.


3.  Use Cases

3.1.  Existing RTP Usages

   There are many different existing RTP usages.  This section brings up
   some that we deem interesting in comparison to the other use cases.

3.1.1.  Basic VoIP call

   This use case is intended to function as a base-line to contrast
   against the rest of the use cases.

   The communication context is an audio-only bi-directional
   communication between two users, Alice and Bob. This communication
   uses a single multi-media session that can be established in a number
   of ways, but let's assume SIP/SDP [RFC3261][RFC3264].  This multi-
   media session contains two end-points, one for Alice and one for Bob.
   Each end-point has an audio capture device that is used to create a
   single audio media source at each end-point.

                        +-------+         +-------+
                        | Alice |<------->|  Bob  |
                        +-------+         +-------+

                      Figure 1: Point-to-point Audio

   The session establishment (SIP/SDP) negotiates the intent to
   communicate over RTP using only the audio media type.  Inherent in
   the application is an assumption of only a single media source in
   each direction.  The boundaries for the encodings are represented
   using RTP Payload types in conjunction with the SDP bandwidth
   parameter (b=).  The session establishment is also used to negotiate
   that RTP will be used, thus resulting in that an RTP session will be
   created for the audio.  The underlying transport flows, in this case
   a single bi-directional UDP flow for RTP, another for RTCP, is
   configured by each end-point providing its' IP address and port,
   which becomes source or destination depending on in which direction
   the packet is sent.

   The RTP session will have two RTP media streams, one in each
   direction, which carries the encoding of the media source the sending
   implementation has chosen based on the boundaries established by the
   RTP payload types and other SDP parameters, e.g. codec, and bit-



Burman & Westerlund      Expires August 4, 2013                 [Page 4]

Internet-Draft        Media Concepts and Relations          January 2013


   rates.  The streams are in the RTP context identified by their SSRCs.

3.1.2.  Audio and Video Conference

   This use case is a multi-party use case with a central conference
   node performing media mixing.  It also includes two media types, both
   audio and video.  The high level topology of the communication
   session is the following:

            +-------+         +------------+           +-------+
            |       |<-Audio->|            |<--Audio-->|       |
            | Alice |         |            |           |  Bob  |
            |       |<-Video->|            |<--Video-->|       |
            +-------+         |            |           +-------+
                              |   Mixer    |
            +-------+         |            |           +-------+
            |       |<-Audio->|            |<--Audio-->|       |
            |Charlie|         |            |           | David |
            |       |<-Video->|            |<--Video-->|       |
            +-------+         +------------+           +-------+

       Figure 2: Audio and Video Conference with Centralized Mixing

   The communication session is a multi-party conference including the
   four users Alice, Bob, Charlie, and David.  This communication
   session contains four end-points and one middlebox (the Mixer).  The
   communication session is established using four different multi-media
   sessions; one each between the user's endpoints and the middlebox.
   Each of these multi-media sessions uses a session establishment
   method, like SIP/SDP.

   Looking at a single multi-media session between a user, e.g.  Alice,
   and the Mixer, there exist two media types, audio and video.  Alice
   has two capture devices, one video camera giving her a video media
   source, and an audio capture device giving an audio media source.
   These two media sources are captured in the same room by the same
   end-point and thus have a strong timing relationship, requiring
   inter-media synchronization at playback to provide the correct
   fidelity.  Thus Alice's endpoint has a synchronization context that
   both her media sources use.

   These two media sources are encoded using encoding parameters within
   the boundaries that has been agreed between the end-point and the
   Mixer using the session establishment.  As has been common practice,
   each media type will use its own RTP session between the end-point
   and the mixer.  Thus a single audio stream using a single SSRC will
   flow from Alice to the Mixer in the Audio RTP session and a single
   video stream will flow in the Video RTP session.  Using this division



Burman & Westerlund      Expires August 4, 2013                 [Page 5]

Internet-Draft        Media Concepts and Relations          January 2013


   in separate RTP sessions, the bandwidth of both audio and video can
   be unambiguously and separately negotiated by the SDP bandwidth
   attributes exchanged between the end-points and the mixer.  Each RTP
   session is using its own Transport Flows.  The common synchronization
   context across Alice's two media streams is identified by binding
   both streams to the same CNAME, generated by Alice's endpoint.

   The mixer does not have any physical capture devices, instead it
   creates conceptual media sources.  It provides two media sources
   towards Alice; one audio being a mix of the audio from Bob, Charlie
   and David, the second one being a conceptual video source that
   contains a selection of one of the other video sources received from
   Bob, Charlie, or David depending on who is speaking.  The Mixer's
   audio and video sources are provided in an encoding using a codec
   that is supported by both Alice's endpoint and the mixer.  These
   streams are identified by a single SSRC in the respective RTP
   session.

   The mixer will have its own synchronization context and it will
   inject the media from Bob, Charlie and David in a synchronized way
   into the mixer's synchronization context to maintain the inter-media
   synchronization of the original media sources.

   The mixer establishes independent multimedia sessions with each of
   the participant's endpoints.  The mixer will in most cases also have
   unique conceptual media sources for each of the endpoints.  This as
   audio mixes and video selections typically exclude media sources
   originating from the receiving end-point.  For example, Bob's audio
   mix will be a mix of Alice, Charlie and David, and will not contain
   Bob's own audio.

   This use case may need unique user identities across the whole
   communication session.  An example functionality of this is a
   participant list which includes audio energy levels showing who is
   speaking within the audio mix.  If that information is carried in RTP
   using the RTP header extension for Mixer to audio clients [RFC6465]
   then contributing source identities in the form of CSRC need to be
   bound to the other end-point's media sources or user identities.
   This despite the fact that each RTP session towards a particular
   user's endpoint is terminated in the RTP mixer.  This points out the
   need for identifiers that exist in multiple multi-media session
   contexts.  In most cases this can easily be solved by the application
   having identities tailored specifically for its own needs, but some
   applications will benefit from having access to some commonly defined
   structure for media source identities.






Burman & Westerlund      Expires August 4, 2013                 [Page 6]

Internet-Draft        Media Concepts and Relations          January 2013


3.1.3.  Audio and Video Switched Conference

   This use case is similar to the one above (Section 3.1.2), with the
   difference that the mixer does not mix media streams by decoding,
   mixing and re-encoding them, but rather switches a selection of
   received media more or less unmodified towards receiving end-points.
   This difference may not be very apparent to the end-user, but the
   main motivations to eliminate the mixing operation and switch rather
   than mix are:

   o  Lower processing requirements in the mixer.

   o  Lower complexity in the mixer.

   o  Higher media quality at the receiver given a certain media
      bitrate.

   o  Lower end-to-end media delay.

   Without the mixing operation, the mixer has limited ability to create
   conceptual media sources that are customized for each receiver.  The
   reasons for such customizations comes from sender and receiver
   differences in available resources and preferences:

   o  Presenting multiple conference users simultaneously, like in a
      video mosaic.

   o  Alignment of sent media quality to receivers presentation needs.

   o  Alignment of codec type and configuration between sender and
      receiver.

   o  Alignment of encoded bitrate to the available end-to-end link
      bandwidth.

   To enable elimination of the mixing operation, media sent to the
   mixer must sufficiently well meet the above constraints for all
   intended receivers.  There are several ways to achieve this.  One way
   is to, by some system-wide design, ensure that all senders and
   receivers are basically identical in all the above aspects.  This may
   however prove unrealistic when variations in conditions and end-
   points are too large.  Another way is to let a sender provide a
   (small) set of alternative representations for each sent media
   source, enough to sufficiently well cover the expected range of
   variation.  If those media source representations, encodings, are
   independent from one another, they constitute a Simulcast of the
   media source.  If an encoding is instead dependent on and thus
   requires reception of one or more other encodings, the representation



Burman & Westerlund      Expires August 4, 2013                 [Page 7]

Internet-Draft        Media Concepts and Relations          January 2013


   of the media source jointly achieved by all dependent encodings is
   said to be Scalable.  Simulcast and Scalable encoding can also be
   combined.

   Both Simulcast and Scalable encodings result in that a single media
   source generates multiple RTP media streams of the same media type.
   The division of bandwidth between the Simulcast or Scalable streams
   for a single media source is application specific and will vary.  The
   total bandwidth for a Simulcast or a Scalable source is the sum of
   all included RTP media streams.  Since all streams in a Simulcast or
   Scalable source originate from the same capture device, they are
   closely related and should thus share synchronization context.

   The first and second customizations listed above, presenting multiple
   conference users simultaneously, aligned with the presentation needs
   in the receiver, can also be achieved without mixing operation by
   simply sending appropriate quality media from those users
   individually to each receiver.  The total bandwidth of this user
   presentation aggregate is the sum of all included RTP media streams.
   Audio and video from a single user share synchronization context and
   can be synchronized.  Streams that originate from different users do
   not have the same synchronization context, which is acceptable since
   they do not need to be synchronized, but just presented jointly.

   An actual mixer device need not be either mixing-only or switching-
   only, but may implement both mixing and switching and may also choose
   dynamically what to do for a specific media and a specific receiving
   user on a case-by-case basis or based on some policy.

3.2.  WebRTC

   This section brings up two different instantiations of WebRTC
   [ref-webrtc10] that stresses different aspects.  But let's start with
   reviewing some important aspects of WebRTC and the MediaStream
   [ref-media-capture] API.

   In WebRTC, an application gets access to a media source by calling
   getUserMedia(), which creates a MediaStream [ref-media-capture] (note
   the capitalization).  A MediaStream consists of zero or more
   MediaStreamTracks, where each MediaStreamTrack is associated with a
   media source.  These locally generated MediaStreams and their tracks
   are connected to local media sources, which can be media devices such
   as video cameras or microphones, but can also be files.

   An WebRTC PeerConnection (PC) is an association between two endpoints
   that is capable of communicating media from one end to the other.
   The PC concept includes establishment procedures, including media
   negotiation.  Thus a PC is an instantiation of a Multimedia Session.



Burman & Westerlund      Expires August 4, 2013                 [Page 8]

Internet-Draft        Media Concepts and Relations          January 2013


   When one end-point adds a MediaStream to a PC, the other endpoint
   will by default receive an encoded representation of the MediaStream
   and the active MediaStreamTracks.

3.2.1.  Mesh-based Multi-party

   This is a use case of WebRTC which establishes a multi-party
   communication session by establishing an individual PC with each
   participant in the communication session.

                              +---+      +---+
                              | A |<---->| B |
                              +---+      +---+
                                ^         ^
                                 \       /
                                  \     /
                                   v   v
                                   +---+
                                   | C |
                                   +---+

                  Figure 3: WebRTC Mesh-based Multi-party

   Users A, B and C want to have a joint communication session.  This
   communication session is created using a Web-application without any
   central conference functionality.  Instead, it uses a mesh of
   PeerConnections to connect each participant's endpoint with the other
   endpoints.  In this example, three double-ended connections are
   required to connect the three participants, and each endpoint has two
   PCs.

   This is an audio and video communication and each end-point has one
   video camera and one microphone as media sources.  Each endpoint
   creates its own MediaStream with one video MediaStreamTrack and one
   audio MediaStreamTrack.  The endpoints add their MediaStream to both
   of their PCs.

   Let's now focus on a single PC; in this case the one established
   between A and B. During the establishment of this PC, the two
   endpoints agree to use only a single transport flow for all media
   types, thus a single RTP session is created between A and B. A's
   MediaStream has one audio media source that is encoded according to
   the boundaries established by the PeerConnection establishment
   signalling, which includes the RTP payload types and thus Codecs
   supported as well as bit-rate boundaries.  The encoding of A's media
   source is then sent in an RTP stream identified by a unique SSRC.  In
   this case, as there are two media sources at A, two encodings will be
   created which will be transmitted using two different RTP streams



Burman & Westerlund      Expires August 4, 2013                 [Page 9]

Internet-Draft        Media Concepts and Relations          January 2013


   with their respective SSRC.  Both these streams will reference the
   same synchronization context through a common CNAME identifier used
   by A. B will have the same configuration, thus resulting in at least
   four SSRC being used in the RTP session part of the A-B PC.

   Depending on the configuration of the two PCs that A has, i.e. the
   A-B and the A-C ones, A could potentially reuse the encoding of a
   media source in both contexts, under certain conditions.  First, a
   common codec and configuration needs to exist and the boundaries for
   these configurations must allow a common work point.  In addition,
   the required bandwidth capacity needs to be available over the paths
   used by the different PCs.  Both of those conditions are not always
   true.  Thus it is quite likely that the endpoint will sometimes
   instead be required to produce two different encodings of the same
   media source.

   If an application needs to reference the media from a particular
   endpoint, it can use the MediaStream and MediaStreamTrack as they
   point back to the media sources at a particular endpoint.  This as
   the MediaStream has a scope that is not PeerConnection specific.

   The programmer can however implement this differently while
   supporting the same use case.  In this case the programmer creates
   two MediaStreams that each have MediaStreamTracks that share common
   media sources.  This can be done either by calling getUserMedia()
   twice, or by cloning the MediaStream obtained by the only
   getUserMedia() call.  In this example the result is two MediaStreams
   that are connected to different PCs.  From an identity perspective,
   the two MediaStreams are different but share common media sources.
   This fact is currently not made explicit in the API.

3.2.2.  Multi-source Endpoints

   This section concerns itself with endpoints that have more than one
   media source for a particular media type.  A straightforward example
   would be a laptop with a built in video camera used to capture the
   user and a second video camera, for example attached by USB, that is
   used to capture something else the user wants to show.  Both these
   cameras are typically present in the same sound field, so it will be
   common to have only a single audio media source.

   A possible way of representing this is to have two MediaStreams, one
   with the built in camera and the audio, and a second one with the USB
   camera and the audio.  Each MediaStream is intended to be played with
   audio and video synchronized, but the user (local or remote) or
   application is likely to switch between the two captures.

   It becomes important for a receiving endpoint that it can determine



Burman & Westerlund      Expires August 4, 2013                [Page 10]

Internet-Draft        Media Concepts and Relations          January 2013


   that the audio in the two MediaStreams have the same synchronization
   context.  Otherwise a receiver may playback the same media source
   twice, with some time overlap, at a switch between playing the two
   MediaStreams.  Being able to determine that they are the same media
   source further allow for removing redundancy by having a single
   encoding if appropriate for both MediaStreamTracks.

3.2.3.  Media Relaying

   WebRTC endpoints can relay a received MediaStream from one PC to
   another by the simple API level maneuver of adding the received
   MediaStream to the other PC.  To realize this in the implementation
   is more complex.  This can also cause some issues from a media
   perspective.  If an application spanning across multiple endpoints
   that relay media between each other makes a mistake, a media loop can
   be created.  Media Loops could become a significant issue.  For
   example could an audio echo be created, i.e. an endpoint receives its
   own media without detecting that it is its own media and plays it
   back with some delay.  In case a WebRTC endpoint produces a
   conceptual media source by mixing incoming MediaStreams, if there is
   no loop detection, a feedback loop can be created.

   RTP has loop detection to detect and handle such cases within a
   single RTP session.  However, in the context of WebRTC, the RTP
   session is local to the PC and thus cannot rely on the RTP level loop
   detection.  Instead, if this protection is needed on the WebRTC
   MediaStream level, it could for example be achieved by having media
   source identifiers that can be preserved between the different
   MediaStreams in the PCs.

   When relaying media and in case one receives multiple encodings of
   the same source it is beneficial to know that.  For example, if one
   encoding arrives with a delay of 80 ms and another with 450 ms, being
   able to choose the one with 80 ms and not be forced to delay all
   media sources from the same synchronization context to the most
   delayed source improves performance.

3.2.4.  Usage of Simulcast

   In this section we look at a use case applying simulcast from each
   user's endpoint to a central conference node to avoid the need for an
   individual encoding to each receiving endpoint.  Instead, the central
   node chooses which of the available encodings that is forwarded to a
   particular receiver, like in Section 3.1.3.







Burman & Westerlund      Expires August 4, 2013                [Page 11]

Internet-Draft        Media Concepts and Relations          January 2013


                +-----------+      +------------+ Enc2 +---+
                | A   +-Enc1|----->|            |----->| B |
                |     |     |      |            |      +---+
                | Src-+-Enc2|----->|            | Enc1 +---+
                +-----------+      |   Mixer    |----->| C |
                                   |            |      +---+
                                   |            | Enc2 +---+
                                   |            |----->| D |
                                   +------------+      +---+

                                 Figure 4

   In this Communication Session there are four users with endpoints and
   one middlebox (The Mixer).  This is an audio and video communication
   session.  The audio source is not simulcasted and the endpoint only
   needs to produce a single encoding.  For the video source, each
   endpoint will produce multiple encodings (Enc1 and Enc2 in Figure 4)
   and transfer them simultaneously to the mixer.  The mixer picks the
   most appropriate encoding for the path from the mixer to each
   receiving client.

   Currently there exists no specified way in WebRTC to realise the
   above, although use-cases and requirements discuss simulcast
   functionality.  The authors believe there exist two possible solution
   alternatives in the WebRTC context:

   Multiple Encodings within a PeerConnection:  The endpoint that wants
      to provide a simulcast creates one or more MediaStreams with the
      media sources it wants to transmit over a particular PC.  The
      WebRTC API provides functionality to enable multiple encodings to
      be produced for a particular MediaStreamTrack and have possibility
      to configure the desired quality levels and/or differences for
      each of the encodings.

   Using Multiple PeerConnections:  There exist capabilities to both
      negotiate and control the codec, bit-rate, video resolution,
      frame-rate, etc of a particular MediaStreamTrack in the context of
      one PeerConnection.  Thus one method to provide multiple encodings
      is to establish multiple PeerConnections between A and the Mixer,
      where each PC is configured to provide the desired quality.  Note
      that this solution comes in two flavors from an application
      perspective.  One is that the same MediaStream object is added to
      the two PeerConnections.  The second is that two different
      MediaStream objects, with the same number of MediaStreamTracks and
      representing the same sources, are created (e.g by cloning), one
      of them added to the first PeerConnection and the second one to
      the second PeerConnection.




Burman & Westerlund      Expires August 4, 2013                [Page 12]

Internet-Draft        Media Concepts and Relations          January 2013


   Both of these solutions share a common requirement, the need to
   separate the received RTP streams not only based on media source, but
   also on the encoding.  However, on an API level the solutions appear
   different.  For Multiple Encodings within the context of a PC, the
   receiver will need new access methods for accessing and manipulating
   the different encodings.  Using multiple PC instead requires that one
   can easily determine the shared (simulcasted) media source despite
   receiving it in multiple MediaStreams on different PCs.  If the same
   MediaStream is added to both PC's the id's of the MediaStream and
   MediaStreamTracks will be the same, while they will be different if
   different MediaStream's (but representing the same sources) are added
   to the two PC's.

3.3.  CLUE Telepresence

   The CLUE framework [I-D.ietf-clue-framework] and use case
   [I-D.ietf-clue-telepresence-use-cases] documents make use of most, if
   not all, media concepts that were already discussed in previous
   sections, and adds a few more.

3.3.1.  Telepresence Functionality

   A communicating CLUE Endpoint can, compared to other types of
   Endpoints, be characterized by using multiple media resources:

   o  Multiple capture devices, such as cameras or microphones,
      generating the media for a media source.

   o  Multiple render devices, such as displays or speakers.

   o  Multiple Media Types, such as audio, video and presentation
      streams.

   o  Multiple remote Endpoints, since conference is a typical use case.

   o  Multiple Encodings (encoded representations) of a media source.

   o  Multiple Media Streams representing multiple media sources.

   To make the multitude of resources more manageable, CLUE introduces
   some additional structures.  For example, related media sources in a
   multimedia session are grouped into Scenes, which can generally be
   represented in different ways, described by alternative Scene
   Entries.  CLUE explicitly separates the concept of a media source
   from the encoded representations of it and a single media source can
   be used to create multiple Encodings.  It is also possible in CLUE to
   account for constraints in resource handling, like limitations in
   possible Encoding combinations due to physical device implementation.



Burman & Westerlund      Expires August 4, 2013                [Page 13]

Internet-Draft        Media Concepts and Relations          January 2013


   The number of media resources typically differ between Endpoints.
   Specifically, the number of available media resources of a certain
   type used for sending at the sender side typically does not match the
   number of corresponding media resources used for receiving at the
   receiver side.  Some selection process must thus be applied either at
   the sender or the receiver to select a subset of resources to be
   used.  Hence, each resource that need to be part of that selection
   process must have some identification and characterization that can
   be understood by the selecting party.  In the CLUE model, the sender
   (Provider) announces available resources and the receiver (Consumer)
   chooses what to receive.  This choice is made independently in the
   two directions of a bi-directional communication.

3.3.2.  Distributed Endpoint

   The definition of a single CLUE Endpoint in the framework
   [I-D.ietf-clue-framework] says it can consist of several physical
   devices with source and sink media streams.  This means that each
   logical node of such distributed Endpoint can have a separate
   transport interface, and thus that media sources originating from the
   same Endpoint can have different transport addresses.


4.  Discussion

   This section discusses some conclusions the authors make based on the
   use cases.  First we will discuss commonalities between use cases.
   Secondly we will provide a summary of issues we see affect WebRTC.
   Lastly we consider aspects that need to be considered in the SDP
   evolution that is ongoing.

4.1.  Commonalities in Use Cases

   The above use cases illustrate a couple of concepts that are not well
   defined, nor have they fully specified standard mechanisms or
   behaviors.  This section contains a discussion of such concepts,
   which the authors believe are useful in more than one context and
   thus should be defined to provide a common function when needed by
   multi-media communication applications.

4.1.1.  Media Source

   In several of the above use cases there exist a need for a separation
   between the media source, the particular encoding and its transport
   stream.  In vanilla RTP there exist a one-to-one mapping between
   these; one media source is encoded in one particular way and
   transported as one RTP stream using a single SSRC in a particular RTP
   session.



Burman & Westerlund      Expires August 4, 2013                [Page 14]

Internet-Draft        Media Concepts and Relations          January 2013


   The reason for not keeping a strict one-to-one mapping, allowing the
   media source to be identified separately from the RTP media stream
   (SSRC), varies depending on the application's needs and the desired
   functionalities:

   Simulcast:  Simulcast is a functionality to provide multiple
      simultaneous encodings of the same media source.  As each encoding
      is independent of the other, in contrast to scalable encoding,
      independent transport streams for each encoding is needed.  The
      receiver of a simulcast stream will need to be able to explicitly
      identify each encoding upon reception, as well as which media
      source it is an encoding of.  This is especially important in a
      context of multiple media sources being provided from the same
      endpoint.

   Mesh-based communication:  When a communication application
      implements multi-party communication through a mesh of transport
      flows, there exist a need for tracking the original media source,
      especially when relaying between nodes is possible.  It is likely
      that the encodings provided over the different transports are
      different.  If an application uses relaying between different
      transports, an endpoint may, intentionally or not, receive
      multiple encodings of the same media source over the same or
      different transports.  Some applications can handle the needed
      identification, but some can benefit from a standardized method to
      identify sources.

   The second argument above can be generalized into a common need in
   applications that utilize multiple multimedia sessions, such as
   multiple PeerConnections or multiple SIP/SDP-established RTP
   sessions, to form a larger communication session between multiple
   endpoints.  These applications commonly need to track media sources
   that occur in more than one multimedia session.

   Looking at both CLUE and WebRTC, they appear to contain their own
   variants of the concept that was above denoted a media source.  In
   CLUE it is called Media Capture.  In WebRTC each MediaStreamTrack is
   identifiable, however, several MediaStreamTracks can share the actual
   source, and there is no way for the application to realize this
   currently.  The identification of sources is being discussed, and
   there is a proposal [ref-leithead] that introduces the concept 'Track
   Source'.  Thus, in this document we see the media source as the
   generalized commonality between these two concepts.  Giving each
   media source a unique identifier in the communication session/context
   that is reused in all the PeerConnections or SIP/SDP-established RTP
   sessions would enable loop detection, correctly associate alternative
   encodings and provide a common name across the endpoints for
   application logic to reference the actual media source rather than a



Burman & Westerlund      Expires August 4, 2013                [Page 15]

Internet-Draft        Media Concepts and Relations          January 2013


   particular encoding or transport stream.

   It is arguable if the application should really know a long term
   persistent source identification, such as based on hardware
   identities, for example due to fingerprinting issues, and it would
   likely be better to use an anonymous identification that is still
   unique in a sufficiently wide context, for example within the
   communication application instance.

4.1.2.  Encodings

   An Encoding is a particular encoded representation of a particular
   media source.  In the context of RTP and Signalling, a particular
   encoding must fit the established parameters, such as RTP payload
   types, media bandwidths, and other more or less codec-specific media
   constraints such as resolution, frame-rate, fidelity, audio
   bandwidth, etc.

   In the context of an application, it appears that there are primarily
   two considerations around the use of multiple encodings.

   The first is how many and what their defining parameters are.  This
   may require to be negotiated, something the existing signalling
   solutions, like SDP, currently lack support for.  For example in SDP,
   there exist no way to express that you would like to receive three
   different encodings of a particular video source.  In addition, if
   you for example prefer these three encodings to be 720p/25 Hz,
   360p/25 Hz and 180p/12.5 Hz, and even if you could define RTP payload
   types with these constraints, they must be linked to RTP streams
   carrying the encodings of the particular source.  Also, for some RTP
   payload types there exist difficulties to express encoding
   characteristics with the desired granularity.  The number of RTP
   payload types that can be used for a particular potential encoding
   can also be a constraint, especially as a single RTP payload type
   could well be used for all three target resolutions and frame rates
   in the example.  Using multiple encodings might even be desirable for
   multi-party conferences that switches video, rather than composites
   and re-encodes it.  It might be that SDP is not the most suitable
   place to negotiate this.  From an application perspective, utilizing
   clients that have standardized APIs or protocols to control them,
   there exist a need for the application to express what it prefers in
   number of encodings as well as what their primary target parameters
   are.

   Secondly, some applications may need explicit indication of what
   encoding a particular stream represents.  In some cases this can be
   deduced based on information such as RTP payload types and parameters
   received in the media stream, but such implicit information will not



Burman & Westerlund      Expires August 4, 2013                [Page 16]

Internet-Draft        Media Concepts and Relations          January 2013


   always be detailed enough and it may also be time-consuming to
   extract.  For example, in SDP there is currently limitations for
   binding the relevant information about a particular encoding to the
   corresponding RTP stream, unless only a single RTP stream is defined
   per media description (m= line).

   The CLUE framework explicitly discusses encodings as constraints that
   are applied when transforming a media source (capture) into what CLUE
   calls a capture encoding.  This includes both explicit identification
   as well as a set of boundary parameters such as maximum width,
   height, frame rate as well as bandwidth.  In WebRTC nothing related
   has yet been defined, and we note this as an issue that needs to be
   resolved.  This as the authors expect that support for multiple
   encodings will be required to enable simulcast and scalability.

4.1.3.  Synchronization contexts

   The shortcomings around synchronization contexts appears rather
   limited.  In RTP, each RTP media stream is associated with a
   particular synchronization context through the CNAME session
   description item.  The main concerns here are likely twofold.

   The first concern is to avoid unnecessary creation of new contexts,
   and rather correctly associate with the contexts that actually exist.
   For example, WebRTC MediaStreams are defined so that all
   MediaStreamTracks within a particular MediaStream shall be
   synchronized.  An easy method for meeting this would be to assign a
   new CNAME for each MediaStream.  However, that would ignore the fact
   that several media sources from the same synchronization context may
   appear in different combinations across several MediaStreams.  Thus
   all these MediaStreams should share synchronization context to avoid
   playback glitches, like playing back different instantiations of a
   single media source out of sync because the media source was shared
   between two different MediaStreams.

   The second problem is that synchronization context identification in
   RTP, i.e.  CNAME, is overloaded as an endpoint identifier.  As an
   example, consider an endpoint that has two synchronization contexts;
   one for audio and video in the room and another for an audio and
   video presentation stream, like the output of an DVD player.  Relying
   on that an endpoint has only a single synchronization context and
   CNAME may be incorrect and could create issues that an application
   designer as well as RTP and signalling extension specifications need
   to watch out for.

   CLUE discusses so far quite little about synchronization, but clearly
   intends to enable lip synchronization between captures that have that
   relation.  The second issue is however quite likely to be encountered



Burman & Westerlund      Expires August 4, 2013                [Page 17]

Internet-Draft        Media Concepts and Relations          January 2013


   in CLUE due to explicit inclusion of the Scene concept, where
   different Scenes do not require to share the same synchronization
   context, but is rather intended for situations where Scenes cannot
   share synchronization context.

4.1.4.  Distributed Endpoints

   When an endpoint consists of multiple nodes, the added complexity is
   often local to that endpoint, which is appropriate.  However, some
   few properties of distributed endpoints needs to be tolerated by all
   entities in a multimedia communication session.  The main item is to
   not assume that a single endpoint will only use a single network
   address.  This is a dangerous assumption even for non-distributed
   endpoints due to multi-homing and the common deployment of NATs,
   especially large scale NATs which in worst case uses multiple
   addresses for a single endpoint's transport flows.

   Distributed endpoints are brought up in the CLUE context.  They are
   not specifically discussed in the WebRTC context, instead the desire
   for transport level aggregation makes such endpoints problematic.
   However, WebRTC does allow for fallback to media type specific
   transport flows and can thus without issues support distributed
   endpoints.

4.2.  Identified WebRTC issues

   In the process of identifying commonalities and differences between
   the different use cases we have identified what to us appears to be
   issues in the current specification of WebRTC that needs to be
   reviewed.

   1.  If simulcast or scalability are to be supported at all, the
       WebRTC API will need to find a method to deal more explicitly
       with the existence of different encodings and how these are
       configured, accessed and referenced.  For simulcast, the authors
       see a quite straightforward solution where each PeerConnection is
       only allowed to contain a single encoding for a specific media
       source and the desired quality level can be negotiated for the
       full PeerConnection.  When multiple encodings are desired,
       multiple PeerConnections with differences in configuration are
       established.  That would only require that the underlying media
       source can explicitly be indicated and tracked by the receiver.

   2.  The current API structure allows to have multiple MediaStreams
       with fully or partially overlapping media sources.  This,
       combined with multiple PeerConnections and the likely possibility
       to do relaying, there appears to exist a significant need to
       determine the underlying media source, despite receiving



Burman & Westerlund      Expires August 4, 2013                [Page 18]

Internet-Draft        Media Concepts and Relations          January 2013


       different MediaStreams with particular media sources encoded in
       different ways.  It is proposed that MediaSources are made
       possible to identify uniquely across multiple PeerConnections in
       the context of the communication application.  It is however
       likely that while being unique in a sufficiently large context,
       the identification should also be anonymous to avoid
       fingerprinting issues, similar to the situation discussed in
       Section 4.1.1.

   3.  Implementations of the MediaStream API must be careful in how
       they name and deal with synchronization contexts, so that the
       actual underlying synchronization context is preserved when
       possible.  It should be noted that cannot be done when a
       MediaStream is created that contains media sources from multiple
       synchronization contexts.  This will instead require
       resynchronization of contributing sources, creation of a new
       synchronization context, and inserting the sources into that
       synchronization context.

   These issues need to be discussed and an appropriate way to resolve
   them must be chosen.

4.3.  Relevant to SDP evolution

   The joint MMUSIC / RTCWeb WGs interim meeting in February 2013 will
   discuss a number of SDP related issues around the handling of
   multiple sources; the aggregation of multiple media types over the
   same RTP session as well as RTP sharing its transport flow not only
   with ICE/STUN but also with the WebRTC data channel using SCTP/DTLS/
   UDP.  These issues will potentially result in a significant impact on
   SDP.  It may also impact other ongoing work as well as existing
   usages and applications, making these discussions difficult.

   The above use cases and discussion points to the existence of a
   number of commonalities between WebRTC and CLUE, and that a solution
   should preferably be usable by both.  It is a very open question how
   much functionality CLUE requires from SDP, as CLUE WG plans to
   develop a protocol with a different usage model.  The appropriate
   division in functionality between SDP and this protocol is currently
   unknown.

   Based on this document, it is possible to express some protocol
   requirements when negotiating multimedia sessions and their media
   configurations.  Note that this is written as requirements to
   consider, given that one believes this functionality is needed in
   SDP.

   The Requirements:



Burman & Westerlund      Expires August 4, 2013                [Page 19]

Internet-Draft        Media Concepts and Relations          January 2013


   Encoding negotiation:  For Simulcast and Scalability in applications,
      it must be possible to negotiate the number and the boundary
      conditions for the desired encodings created from a particular
      media source.

   Media Resource Identification:  SDP-based applications that need
      explicit information about media sources, multiple encodings and
      their related RTP media streams could benefit from a common way of
      providing this information.  This need can result in multiple
      different actual requirements.  Some require a common, explicit
      identification of media sources across multiple signalling
      contexts.  Some may require explicit indication of which set of
      encodings that has the same media source and thus which sets of
      RTP media streams (SSRCs) that are related to a particular media
      source.

   RTP media stream parameters:  With a greater heterogeneity of the
      possible encodings and their boundary conditions, situations may
      arise where some or sets of RTP media streams will need to have
      specific sets of parameters associated with them, compared to
      other (sets of) RTP media streams.

   The above are general requirements and in some cases the appropriate
   point to address the requirement may not even be SDP.  For example,
   media source identification could primarily be put in an RTCP Session
   Description (SDES) item, and only when so required by the application
   also be included in the signalling.

   The discussion in this document has impact on the high level decision
   regarding how to relate RTP media streams to SDP media descriptions.
   However, as it is currently presenting concepts rather than giving
   concrete proposals on how to enable these concepts as extensions to
   SDP or other protocols, it is difficult to determine the actual
   impact that a high level solution will have.  However, the authors
   are convinced that neither of the directions will prevent the
   definition of suitable concepts in SDP.


5.  IANA Considerations

   This document makes no request of IANA.

   Note to RFC Editor: this section may be removed on publication as an
   RFC.







Burman & Westerlund      Expires August 4, 2013                [Page 20]

Internet-Draft        Media Concepts and Relations          January 2013


6.  Security Considerations

   The realization of the proposed concepts and the resolution will have
   security considerations.  However, at this stage it is unclear if any
   has not already common considerations regarding preserving privacy,
   confidentiality and ensure integrity to prevent denial of service or
   quality degradations.


7.  Informative References

   [I-D.ietf-avtcore-multi-media-rtp-session]
              Westerlund, M., Perkins, C., and J. Lennox, "Multiple
              Media Types in an RTP Session",
              draft-ietf-avtcore-multi-media-rtp-session-01 (work in
              progress), October 2012.

   [I-D.ietf-clue-framework]
              Duckworth, M., Pepperell, A., and S. Wenger, "Framework
              for Telepresence Multi-Streams",
              draft-ietf-clue-framework-08 (work in progress),
              December 2012.

   [I-D.ietf-clue-telepresence-use-cases]
              Romanow, A., Botzko, S., Duckworth, M., Even, R., and I.
              Communications, "Use Cases for Telepresence Multi-
              streams", draft-ietf-clue-telepresence-use-cases-04 (work
              in progress), August 2012.

   [I-D.ietf-mmusic-sdp-bundle-negotiation]
              Holmberg, C. and H. Alvestrand, "Multiplexing Negotiation
              Using Session Description Protocol (SDP) Port Numbers",
              draft-ietf-mmusic-sdp-bundle-negotiation-01 (work in
              progress), August 2012.

   [I-D.ietf-rtcweb-overview]
              Alvestrand, H., "Overview: Real Time Protocols for Brower-
              based Applications", draft-ietf-rtcweb-overview-05 (work
              in progress), December 2012.

   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
              A., Peterson, J., Sparks, R., Handley, M., and E.
              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
              June 2002.

   [RFC3264]  Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model
              with Session Description Protocol (SDP)", RFC 3264,
              June 2002.



Burman & Westerlund      Expires August 4, 2013                [Page 21]

Internet-Draft        Media Concepts and Relations          January 2013


   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
              Jacobson, "RTP: A Transport Protocol for Real-Time
              Applications", STD 64, RFC 3550, July 2003.

   [RFC6465]  Ivov, E., Marocco, E., and J. Lennox, "A Real-time
              Transport Protocol (RTP) Header Extension for Mixer-to-
              Client Audio Level Indication", RFC 6465, December 2011.

   [ref-leithead]
              Microsoft, "Proposal: Media Capture and Streams Settings
              API v6, https://dvcs.w3.org/hg/dap/raw-file/tip/
              media-stream-capture/proposals/
              SettingsAPI_proposal_v6.html", December 2012.

   [ref-media-capture]
              "Media Capture and Streams,
              http://dev.w3.org/2011/webrtc/editor/getusermedia.html",
              December 2012.

   [ref-webrtc10]
              "WebRTC 1.0: Real-time Communication Between Browsers,
              http://dev.w3.org/2011/webrtc/editor/webrtc.html",
              January 2013.


Authors' Addresses

   Bo Burman
   Ericsson
   Farogatan 6
   SE-164 80 Kista
   Sweden

   Phone: +46 10 714 13 11
   Email: bo.burman@ericsson.com


   Magnus Westerlund
   Ericsson
   Farogatan 6
   SE-164 80 Kista
   Sweden

   Phone: +46 10 714 82 87
   Email: magnus.westerlund@ericsson.com






Burman & Westerlund      Expires August 4, 2013                [Page 22]