Internet DRAFT - draft-groves-clue-multi-content
draft-groves-clue-multi-content
CLUE C. Groves, Ed.
Internet-Draft W. Yang
Intended status: Informational R. Even
Expires: February 13, 2014 Huawei
August 12, 2013
Describing Captures in CLUE and relation to multipoint conferencing
draft-groves-clue-multi-content-00
Abstract
In a multipoint Telepresence conference, there are more than two
sites participating. Additional complexity is required to enable
media streams from each participant to show up on the displays of the
other participants. Common policies to address the multipoint case
include "site-switch" and "segment-switch". The document will
discuss these policies as well as the "composed" policy and how they
work in the multipoint case.
The current CLUE framework document contains the "composed" and
"switched" attributes to describe situations where a capture is mix
or composition of streams or where the capture represents a dynamic
subset of streams. "Composed" and "switched" are capture level
attributes. In addition to these attributes the framework defines an
attribute "Scene-switch-policy" on a capture scene entry (CSE) level
which indicates how the captures are switched.
This draft discusses composition/switching in CLUE and makes a number
of proposals to better define and support these capabilities.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on February 13, 2014.
Groves, et al. Expires February 13, 2014 [Page 1]
Internet-Draft Abbreviated Title August 2013
Copyright Notice
Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1. Role of an MCU in a multipoint conference . . . . . . . . 4
2.2. Relation to scene . . . . . . . . . . . . . . . . . . . . 6
2.3. Description of the contents of a switched/composed
capture . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4. Attribute interaction . . . . . . . . . . . . . . . . . . 7
2.5. Policy . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6. Media stream composition and encodings . . . . . . . . . 8
2.7. Relation of switched captures to simultaneous
transmission sets . . . . . . . . . . . . . . . . . . . . 9
2.8. Conveying spatial information for switched/composed
captures . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9. Consumer selection . . . . . . . . . . . . . . . . . . . 10
3. Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1. CLUE Syntax Updates . . . . . . . . . . . . . . . . . . . 11
3.1.1. Definitions . . . . . . . . . . . . . . . . . . . . . 12
3.1.2. Multiple Content Capture Details . . . . . . . . . . 12
3.1.3. MCC Attributes . . . . . . . . . . . . . . . . . . . 13
3.1.4. MCC Attributes . . . . . . . . . . . . . . . . . . . 13
3.1.5. Composition policy . . . . . . . . . . . . . . . . . 14
3.1.6. Synchronisation . . . . . . . . . . . . . . . . . . . 14
3.1.7. MCC and encodings . . . . . . . . . . . . . . . . . . 15
3.1.8. MCCs and STSs . . . . . . . . . . . . . . . . . . . . 16
3.1.9. Consumer Behaviour . . . . . . . . . . . . . . . . . 16
3.1.10. MCU Behaviour . . . . . . . . . . . . . . . . . . . . 17
3.1.10.1. Single content captures and multiple contents
capture in the same Advertisement . . . . . . . 17
3.1.10.2. Several multiple content captures in the same
Advertisement . . . . . . . . . . . . . . . . . 18
3.2. Multipoint Conferencing Framework Updates . . . . . . . . 19
Groves, et al. Expires February 13, 2014 [Page 2]
Internet-Draft Abbreviated Title August 2013
3.3. Existing Parameter Updates . . . . . . . . . . . . . . . 20
3.3.1. Composed . . . . . . . . . . . . . . . . . . . . . . 20
3.3.2. Switched . . . . . . . . . . . . . . . . . . . . . . 21
3.3.3. Scene-switch-policy . . . . . . . . . . . . . . . . . 22
3.3.4. MCU behaviour . . . . . . . . . . . . . . . . . . . . 24
4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 24
5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 24
6. Security Considerations . . . . . . . . . . . . . . . . . . . 25
7. References . . . . . . . . . . . . . . . . . . . . . . . . . 25
7.1. Normative References . . . . . . . . . . . . . . . . . . 25
7.2. Informative References . . . . . . . . . . . . . . . . . 25
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 26
1. Introduction
One major objective for Telepresence is to be able to preserve the
"Being there" user experience. However, in multi-site conferences it
is often (in fact usually) not possible to simultaneously provide
full size video, eye contact, common perception of gestures and gaze
by all participants. Several policies can be used for stream
distribution and display: all provide good results but they all make
different compromises.
The policies are described in [I-D.ietf-clue-telepresence-use-cases].
[RFC6501] has the following requirement:
REQMT-14: The solution MUST support mechanisms to make possible for
either or both site switching or segment switching. [Edt:
This needs rewording. Deferred until layout discussion is
resolved.]
The policies described in the use case draft include the site-switch,
segment-switch and composed policies.
Site switch is described in the CLUE use case "One common policy is
called site switching. Let's say the speaker is at site A and
everyone else is at a "remote" site. When the room at site A shown,
all the camera images from site A are forwarded to the remote sites.
Therefore at each receiving remote site, all the screens display
camera images from site A. This can be used to preserve full size
image display, and also provide full visual context of the displayed
far end, site A. In site switching, there is a fixed relation
between the cameras in each room and the displays in remote rooms.
The room or participants being shown is switched from time to time
based on who is speaking or by manual control, e.g., from site A to
site B."
Groves, et al. Expires February 13, 2014 [Page 3]
Internet-Draft Abbreviated Title August 2013
These policies are mirrored in the framework document through a
number of attributes.
Currently in the CLUE framework document [I-D.ietf-clue-framework]
there are two media capture attributes: Composed and Switched.
Composed is defined as:
A field with a Boolean value which indicates whether or not
the Media Capture is a mix (audio) or composition (video) of
streams.
This attribute is useful for a media consumer to avoid
nesting a composed video capture into another composed
capture or rendering. This attribute is not intended to
describe the layout a media provider uses when composing
video streams.
Switched is defined as:
A field with a Boolean value which indicates whether or not
the Media Capture represents the (dynamic) most appropriate
subset of a 'whole'. What is 'most appropriate' is up to the
provider and could be the active speaker, a lecturer or a
VIP.
There is also a Capture Scene Entry (CSE) attribute "scene switch
policy" defined as:
A media provider uses this scene-switch-policy attribute to
indicate its support for different switching policies.
2. Issues
This section discusses a number of issues in the current framework
around the support of switched/composed captures and media streams
when considering multipoint conferencing. Some issues are more
required functions and some are related to the current description in
the framework document.
2.1. Role of an MCU in a multipoint conference
In a multipoint conference there is a central control point (MCU).
The MCU will have the CLUE advertisements from all the conference
participants and will prepare and send advertisements to all the
conference participants. The MCU will also have more information
about the conference, participants and media which it receives at
conference creation and via call signalling. This data is not stable
Groves, et al. Expires February 13, 2014 [Page 4]
Internet-Draft Abbreviated Title August 2013
since each user who joins or leaves the conference causes a change is
conference state. An MCU supporting SIP may utilise the Conference
event package, XCON and CCMP to maintain and distribute conference
state.
[RFC4575] defines a conference event package. Using the event
framework notifications are sent about changes in the membership of
this conference and optionally about changes in the state of
additional conference components. The conference information is
composed of the conference description, host information, conference
state, users that has endpoints where each endpoint includes the
media description.
[RFC6501] extends the conference event package and tries to be
signalling protocol agnostic. RFC6501 adds new elements but also
provides values for some of the elements defined in RFC4575, for
example it defines roles ( like "administrator", "moderator", "user",
"participant", "observer", and "none").
[RFC6503] Centralized Conferencing Manipulation Protocol (CCMP)
allows authenticated and authorized users to create, manipulate, and
delete conference objects. Operations on conferences include adding
and removing participants, changing their roles, as well as adding
and removing media streams and associated endpoints.
CCMP implements the client-server model within the XCON framework,
with the conferencing client and conference server acting as client
and server, respectively. CCMP uses HTTP as the protocol to transfer
requests and responses, which contain the domain-specific XML-encoded
data objects defined in [RFC6501] "Conference Information Data Model
for Centralized Conferencing (XCON)".
The XCON data model and CCMP provides a generic way to create and
control conferences. CCMP is not SIP specific but SIP endpoint will
subscribe to the conference event package to get information about
changes in the conference state.
Therefore when a MCU implements the above protocols there will be an
interaction between any CLUE states and those within a conferencing
framework. For example: if an endpoint leaves a conference this will
mean that an MCU may need to indicate via CLUE to the other endpoints
that those captures are no longer available and it would also need to
indicate via the Conferencing framework that the endpoint is longer
part of the conference.
The question is how do these concepts relate as the Conferencing
framework does not have the concept of captures or scenes? Other
aspects overlap, for example:
Groves, et al. Expires February 13, 2014 [Page 5]
Internet-Draft Abbreviated Title August 2013
The conference framework has "available media" , CLUE has
encodings to indicate codec.
The conference framework has "users", CLUE has no concept of
users although it has capture attributes that relate to the
users in a capture.
It is noted point to point calls may not implement the conferencing
framework. It is desirable that CLUE procedures be the same whether
an endpoint is communicating with a peer endpoint or an MCU.
2.2. Relation to scene
One of the early justifications for switching / composition was the
ability to switch between sites. When looking at the CLUE framework
there is no concept of "site" in the CLUE hierarchy. The closest
concept is an "endpoint" but this has no identity within the CLUE
syntax. The highest level is the "clueInfo" that includes
captureScenes and an endpoint may have multiple capture scenes.
If the switched and composed attributes are specified at a capture
level it is not clear what the correlation is between the capture and
the endpoint / scenes, particularly when the attributes are described
in the context of sites. A scene may be composed of multiple
captures. Where an MCU is involved in a conference with multiple
endpoints, multiple capture scenes are involved. It becomes
difficult to map all the scenes and capture information from the
source endpoints into one capture scene sent to an endpoint.
Discussion of switching, composition et al. needs to be described in
terms of the CLUE concepts.
When considering the SIP conferencing framework it can be seen that
there are complications with interworking with the scene concept.
There may be multiple media of the same type e.g room view and
presentation but they are easily identified. This also needs to be
considered.
2.3. Description of the contents of a switched/composed capture
Groves, et al. Expires February 13, 2014 [Page 6]
Internet-Draft Abbreviated Title August 2013
When considering switching and composition whilst this may be
represented by one capture and one resulting media stream there may
be multiple original source captures. Each of these source captures
would have had their own set of attributes. A media capture with the
composed attribute allows the description of the capture as whole but
not a description of the constituent parts. In the case of a MCU
taking multiple input media captures and compositing them into one
output capture the CLUE related characteristics of these inputs are
lost in the current solution. Alternate methods such as the RFC6501
layout field etc. may need to be investigated.
Consider the case where MCUs receive CLUE advertisements from various
endpoints. Having a single capture with a switched attribute makes
it difficult to fully express what the content is when it is from
multiple endpoints. It may be possible to specify lists of capture
attribute values when sending an advertisement from the MCU, i.e.
role=speaker,audience but it becomes difficult to relate multiple
attributes, i.e.
(role=speaker,language=English),(role=audience,language=french).
One capture could represent source captures from multiple locations.
A consumer may wish to examine the inputs to a switched capture ,
i.e. choose which of the original endpoints it wants to see/hear. In
order to do this the original capture information would need to be
conveyed in a manner that minimises overhead for the MCU.
By being able to link multiple source captures to one mixed (switched
/composed) capture in a CLUE advertisement allows a fuller
description of the content of the capture.
2.4. Attribute interaction
Today the "composed" and "switched" attributes appear at a media
capture level. If "switched" is specified for multiple captures in a
capture scene it's not clear from the framework what the switching
policy is. For example: If a CSE contains three VCs each with
"switched" does the switch occurs between these captures? Does the
switch occur internal to each capture?
The "scene-switch-policy" CSE attribute has been defined to indicate
switch policy but there doesn't appear to be a description of whether
this only relates to captures marked with "switch" and/or "composed"?
If a CSE marked with "scene-switch-policy" contains non-switched,
non-composed captures what does this mean?
Groves, et al. Expires February 13, 2014 [Page 7]
Internet-Draft Abbreviated Title August 2013
What are the interactions between the two properties? E.g. Are
"switched" and "composed" attributes mutually exclusive or not? Is
switched capture with a scene switch policy of "segment-switched" a
"composed" capture?
These issues need to be clarified in the framework.
2.5. Policy
The "Scene-switch-policy" attribute allows the indication of whether
switched captures are "site" or "segment" switched. However there is
no indication of what the switch or the composition "trigger" policy
is. Content could be provided based on a round robin view, loudest
speaker etc. Where an advertising endpoint supports different
algorithms it would be advantageous for a consumer to know and select
an applicable policy.
2.6. Media stream composition and encodings
Whether single or multiple streams are used for switched captures is
not clear from the capture description. For example:
There are 3 endpoints (A,B,C) each with 3 video
captures.(VCa1,VCa2,VCa3, etc.). A MCU wants to indicate to endpoint
C that it can offer a switched view of endpoints A and B.
It could send an Advertisement with CSE (VCa1,VCa2,VCa3,
VCb1,VCb2,VCb3),scene-switch-policy=site-switch.
Normally such a configuration (without the switch policy) would
relate to 6 different media streams. Switching introduces several
possibilities.
For site switching:
a) There could one media stream with the contents of all the 6
captures. The MCU always send a composed image with the VCs
from the applicable end point.
b) There could be two media streams each containing the VCs from
one endpoint, the MCU chooses which stream to send.
c) There could be 6 media streams. The MCU chooses which 3
streams to send.
For segment switching this is further complicated because the MCU may
choose to send media related to endpoint A or B. There no text
describing any limitation so the MCU may send 1 VC or 5.
Groves, et al. Expires February 13, 2014 [Page 8]
Internet-Draft Abbreviated Title August 2013
Utilising CLUE "encodings" may be a way to describe how the switch is
taking place in terms of media provided but such description is
missing from the framework. One could assume that an individual
encoding be assigned to multiple media captures (i.e. multiple VCs to
indicate they are encoded in the same stream) but again this is
problematic as the framework indicates that "An Individual Encoding
can be assigned to at most one Capture Encoding at any given time."
This could do with further clarification in the framework.
2.7. Relation of switched captures to simultaneous transmission sets
Simultaneous Transmission Set is defined as "a set of Media Captures
that can be transmitted simultaneously from a Media Provider." It's
not clear how this definition would relate to switched or composed
streams. The captures may not be able to be sent at the same time
but may form a timeslot on a particular stream. They may be provided
together but not at precisely the same time.
The current version of the framework in section 6.3 it indicates
that:
"It is a syntax conformance requirement that the simultaneous
transmission sets must allow all the media captures in any
particular Capture Scene Entry to be used simultaneously."
If switching or composition is specified at a capture level only it
is evident that simultaneity constraints do not come into play.
However if multiple captures are used in a single media stream I.e.
associated with the CSE then these may be subject to a simultaneous
transmission set description.
It is also noted that there is a similar issue for encoding group.
See section 8/[Framework]:
"It is a protocol conformance requirement that the Encoding
Groups must allow all the Captures in a particular Capture
Scene Entry to be used simultaneously."
If "switching" is used then there is no need to send the encodings at
the same time.
This needs to be clarified.
2.8. Conveying spatial information for switched/composed captures
CLUE currently allows the ability to signal spatial information
related to a media capture. It is unclear in the current draft how
Groves, et al. Expires February 13, 2014 [Page 9]
Internet-Draft Abbreviated Title August 2013
this would work with switching/composition. In section 6.1 /
[I-D.ietf-clue-telepresence-use-cases] it does say:
"For a switched capture that switches between different
sections within a larger area, the area of capture should use
coordinates for the larger potential area."
This describes a single capture not when there are multiple switched
captures. It appears to focus on segment switching rather than site
switching and does not appear to cover "composed" (if it is related).
An advertiser may or may not want to use common spatial attributes
for captures associated with a switched captures. For example: it
may be beneficial for the Advertiser in a composed image to indicate
that different captures have a different capture area in a virtual
space.
This should be given consideration in the framework.
2.9. Consumer selection
In section 6.2.2 of version 9 [I-D.ietf-clue-framework] it indicates
that an Advertiser may provide multiple values for the "scene-switch-
policy" and that the Consumer may choose and return the value it
requires.
In version 9 of the framework there was no mechanism in CLUE for a
Consumer to choose and return individual values from capture scene,
CSE or media capture attributes.
In version 10 of the framework the text was updated to indicate that
the consumer could choose values from a list. It is not clear that
this capability is needed as the procedure only relates to the
"scene-switch-policy". The switching policy may be better specified
by other means.
3. Proposal
As has been discussed above there are a number of issues with regards
to the support of switched/composed captures/streams in CLUE
particularly when considering MCUs. The authors believe that there
is no single action that can address the above issues. Several
options are discussed below. The options are not mutually exclusive.
1) Introduce syntax to CLUE to better describe source captures
2) Introduce updates to the XCON conferencing framework (e.g.
Conference package, XCON etc.) to introduce CLUE concepts.
Groves, et al. Expires February 13, 2014 [Page 10]
Internet-Draft Abbreviated Title August 2013
3) Update CLUE to better describe the current suite of
attributes with the understanding these provide limited
information with respect to source information.
3.1. CLUE Syntax Updates
The authors believe that there are a number of requirements for this:
- It should be possible to advertise the individual captures
that make up a single switched/composed media stream before
receiving the actual media stream.
- It should be possible to describe the relationship between
captures that make up a single switched/composed media
stream.
- It should be possible to describe this using CLUE semantics
rather than with terms such as "site" or "segment" which need
their own definition.
The authors also believe that whether media is composed, segment
switched, site switched the common element is that the media stream
contains multiple captures from potentially multiple sources.
[I-D.ietf-clue-framework] does have the "Scene-switch-policy"
attribute at a CSE level but as described in section 2 it is not
sufficient for several reasons. E.g. it is not possible assign an
encoding to a CSE, a CSE cannot reference captures from multiple
scenes and there is a relationship with STSs that needs to be
considered.
In order to be able to fully express and support media stream with
multiple captures the authors propose a new type of capture, the
"multiple content capture" (MCC). The MCC is essentially the same as
audio or video captures in that it may have its own attributes the
main difference is that it can also include other captures. It
indicates that the MCC capture is composed of other captures. This
composition may be positional (i.e. segments/tiling) or time
composition (switched) etc. and specified by a policy attribute. The
MCC can be assigned an encoding. For example:
MCC1(VC1,VC2,VC3),[POLICY]
This would indicate that MCC1 is composed of 3 video captures
according to the policy.
One further difference is that a MCC may reference individual
captures from multiple scenes. For example:
Groves, et al. Expires February 13, 2014 [Page 11]
Internet-Draft Abbreviated Title August 2013
CS#1(VC1,VC2)
CS#2(VC3,VC4)
CS#3(MCC1(VC1,VC3))
This would indicate that scene #3 contains a MCC that is composed
from individual encodings VC1 and VC3. This allows the consumer to
associate any capture scene properties from the original scene with
the multiple content capture.
The MCC would be able to be utilised by both normal endpoints and
MCUs. For example: it would allow an endpoint to construct a mixed
video stream that is a virtual scene with a composition of
presentation video and individual captures.
This proposal does not consider any relation to the SIP conferencing
framework.
The sections below provide more detail on the proposal.
3.1.1. Definitions
Multiple content capture: Media capture for audio or video that
indicates the capture contains multiple audio or video captures.
Individual media captures may or may not be present in the resultant
capture encoding depending on time or space. Denoted as MCCn in the
example cases in this document.
3.1.2. Multiple Content Capture Details
The MCC indicates that multiple captures are contained in one media
capture by referencing the applicable individual media captures.
Only one capture type (i.e. audio, video, etc.) is allowed in each
MCC instance. The MCC contains a reference to the media captures as
well attributes associated with the MCC itself. The MCC may
reference individual captures from other capture scenes. If an MCC
is used in a CSE that CSE may also reference captures from other
Capture Scenes.
Note: Different Capture Scenes are not spatially related.
Each instance of the MCC has its own captureID i.e. MCC1. This
allows all the individual captures contained in the MCC to be
referenced by a single ID.
The example below shows the use of a MultipleContent capture:
Groves, et al. Expires February 13, 2014 [Page 12]
Internet-Draft Abbreviated Title August 2013
CaptureScene1 [VC1 {attributes},
VC2 {attributes},
VC3 {attributes},
MCC1(VC1,VC2,VC3){attributes}]
This indicates that MCC1 is a single capture that contains the
captures VC1, VC2 and VC3 according to any MCC1 attributes.
One or more MCCs may also specified in a CSE. This allows an
Advertiser to indicate that several MCC captures are used to
represent a capture scene.
Note: Section 6.1/[I-D.ietf-clue-framework] indicates that "A Media
Capture is associated with exactly one Capture Scene". For MCC this
could be further clarified to indicate that "A Media Capture is
defined in a capture scene and is given an advertisement unique
identity. The identity may be referenced outside the Capture Scene
that defines it through a multiple content capture (MCC).
3.1.3. MCC Attributes
Attributes may be associated with the MCC instance and the individual
captures that the MCC references. A provider should avoid providing
conflicting attribute values between the MCC and individual captures.
Where there is conflict the attributes of the MCC override any that
may be present in the individual captures.
There are two MCC specific attributes "MaxCaptures" and "Policy"
which are used to give more information regarding when the individual
captures appears and what policy is used to determine this.
The spatial related attributes can be further used to determine how
the individual captures "appear" within a stream. For example a
virtual scene could be constructed for the MCC capture with two video
captures with a "MaxCaptures" attribute of 2 and an "area of capture"
attribute provided with an overall area. Each of the individual
captures could then also include an "area of capture" attribute with
a sub-set of the overall area. The consumer would then know the
relative position of the content in the composed stream. For
example: The above capture scene may indicate that VC1 has an x-axis
capture area 1-5, VC2 6-10 and VC3 11-15. The MCC capture may
indicate an x-axis capture area 1-15.
3.1.4. MCC Attributes
MaxCaptures:{integer}
Groves, et al. Expires February 13, 2014 [Page 13]
Internet-Draft Abbreviated Title August 2013
This field is only associated with MCCs and indicates the maximum
number of individual captures that may appear in a capture encoding
at a time. It may be used to derive how the individual captures
within the MCC are composed with regards to space and time.
Individual content in the capture may be switched in time so that
only one of the individual captures/CSEs are shown (MaxCaptures:1).
The individual captures may be composed so that they are all shown in
the MCC (MaxCaptures:n).
For example:
MCC1(VC1,VC2,VC3),MaxCaptures:1
This would indicate that the Advertiser in the capture encoding would
switch (or compose depending on policy) between VC1, VC2 or VC3 as
there may be only a maximum of one capture at a time.
3.1.5. Composition policy
TBD - This attribute is to address what algorithm the endpoint/MCU
uses to determine what appears in the MCC captures. E.g. loudest,
round robin.
3.1.6. Synchronisation
Note: The {scene-switch-policy} attribute has values that indicates
"site-switch" or "segment" switch. The distinction between these is
that "site-switch" indicates that when there is mixed content that
captures related to an endpoint appear together. "segment-switch"
indicates that different endpoints captures could appear together.
An issue is that a Consumer has no concept of "endpoints" only
"capture scenes". Also as highlighted a Consumer has no method to
return parameters for CSEs.
The use of MCCs enables the Advertiser to communicate to the Consumer
that captures originate from different captures scenes. In cases
where multiple MCCs represent a scene (i.e. multiple MCCs in a CSE)
an Advertiser may wish to indicate that captures from one capture
scene are present in the capture encodings of specified MCCs at the
same time. Having an attribute at capture level removes the need for
CSE level attributes which are problematic for consumers.
Synch-id: { integer}
This MCC attribute indicates how the individual captures in multiple
MCC captures are synchronised. To indicate that the capture
encodings associated with MCCs contain captures from the source at
the same time the Advertiser should set the same SynchID on each of
Groves, et al. Expires February 13, 2014 [Page 14]
Internet-Draft Abbreviated Title August 2013
the concerned MCCs. It is the provider that determines what the
source for the captures is. For example when the provider is in an
MCU it may determine that each separate CLUE endpoint is a remote
source of media.
For example:
CaptureScene1[Description=AustralianConfRoom,
VC1(left),VC2(middle),VC3(right),
CSE1(VC1,VC2,VC3)]
CaptureScene2[Description=ChinaConfRoom,
VC4(left),VC5(middle),VC6(right),
CSE2(VC4,VC5,VC6)]
CaptureScene3[MCC1(VC1,VC4){Sync-id:1}{encodinggroup1},
MCC2(VC2,VC5){Sync-id:1}{encodinggroup2},
MCC3(VC3,VC6){encodinggroup3},
CSE3(MCC1,MCC2,MCC3)]
Figure 1: Synchronisation Example
The above advertisement would indicate MCC1,MCC2,MCC3 make up a
capture scene. There would be three capture encodings. Because MCC1
and MCC2 have the same Sync-id, each encoding1 and encoding2 would
together have content from only capture scene 1 or only capture scene
2 at a particular point in time. Encoding3 would not be synchronised
with encoding1 or encoding2.
Without this attribute it is assumed that multiple MCCs may provide
different sources at any particular point in time.
3.1.7. MCC and encodings
MCCs shall be assigned an encoding group and thus become a capture
encoding. The captures referenced by the MCC do not need to be
assigned to an encoding group. This means that all the individual
captures referenced by the MCC will appear in the capture encoding
according to any MCC attributes. This allows an Advertiser to
specify capture attributes associated with the individual captures
without the need to provide an individual capture encoding for each
of the inputs.
If an encoding group is assigned to an individual capture referenced
by the MCC it indicates that this capture may also have an individual
capture encoding.
For example:
Groves, et al. Expires February 13, 2014 [Page 15]
Internet-Draft Abbreviated Title August 2013
CaptureScene1 [VC1 {encoding group1},
VC2 ]
MCC1(VC1,VC2){encoding group3}]
This would indicate that VC1 may be sent as its own capture encoding
from encoding group1 or that it may be sent as part of a capture
encoding from encoding group3 along with VC2.
Note: The section 8/[I-D.ietf-clue-framework] indicates that every
capture is associated with an encoding group. To utilise MCCs this
requirement has to be relaxed.
3.1.8. MCCs and STSs
The MCC can be used in simultaneous sets, therefore providing a means
to indicate whether several multiple content captures can be provided
at the same time. Captures within a MCC can be provided together but
not necessarily at the same time. Therefore by specifying a MCC in
an STS it does not indicate that all the referenced individual
captures may be present at a time. The MaxCaptures attributes
indicates the maximum number of captures that may be present.
An MCC instance of is limited to one media type e.g. video, audio,
text.
Note: This gets around the problem where the framework says that all
captures (even switched ones) within a CSE have to be allowed in a
STS to be sent at the same time.
3.1.9. Consumer Behaviour
On receipt of an advertisement with an MCC the Consumer treats the
MCC as per other individual captures with the following differences:
- The Consumer would understand that the MCC is a capture that
includes the referenced individual captures and that these
individual captures would be delivered as part of the MCC's
capture encoding.
- The Consumer may utilise any of the attributes associated
with the referenced individual captures and any capture scene
attributes from where the individual capture was defined to
choose the captures.
- The Consumer may or may not want to receive all the indicated
captures. Therefore it can choose to receive a sub-set of
captures indicated by the MCC.
Groves, et al. Expires February 13, 2014 [Page 16]
Internet-Draft Abbreviated Title August 2013
For example if the Consumer receives:
MCC1(VC1,VC2,VC3){attributes}
A Consumer should choose all the captures within a MCCs however if
the consumer determines that it doesn't want VC3 it can return
MCC1(VC1,VC2). If it wants all the individual capture then it
returns just a reference to the MCC (i.e. MCC1).
Note: The ability to return a subset of capture is for consistency
with the current framework. It says that a Consumer should choose
all the captures from a CSE but it allows it to select a subset (if
the STS is provided). The intent was to provide equivalent
functionality for a MCC.
3.1.10. MCU Behaviour
The use of MCCs allows the MCU to easily construct outgoing
Advertisements. The following sections provide several examples.
3.1.10.1. Single content captures and multiple contents capture in the
same Advertisement
Four endpoints are involved in a CLUE session. To formulate an
Advertisement to endpoint 4 the following Advertisements received
from endpoint 1 to 3 and used by the MCU. Note: The IDs overlap in
the incoming advertisements. The MCU is responsible for making these
unique in the outgoing advertisement.
Endpoint 1 CaptureScene1[Description=AustralianConfRoom,
VC1(role=audience)]
Endpoint 2 CaptureScene1[Description=ChinaConfRoom,
VC1(role=speaker),VC2(role=audience),
CSE1(VC1,VC2)]
Endpoint 3 CaptureScene1[Description=USAConfRoom,
VC1(role=audience)]
Figure 2: MCU case: Received advertisements
Note: Endpoint 2 above indicates that it sends two streams.
If the MCU wanted to provide a multiple content capture containing
the audience of the 3 endpoints and the speaker it could construct
the following advertisement:
CaptureScene1[Description=AustralianConfRoom,
VC1(role=audience)]
CaptureScene2[Description=ChinaConfRoom,
Groves, et al. Expires February 13, 2014 [Page 17]
Internet-Draft Abbreviated Title August 2013
VC2(role=speaker),VC3(role=audience),
CSE1(VC2,VC3)]
CaptureScene3[Description=USAConfRoom,
VC4(role=audience)]
CaptureScene4[MCC1(VC1,VC2,VC3,VC4){encodinggroup1}]
Figure 3: MCU case: MCC with multiple audience and speaker
Alternatively if the MCU wanted to provide the speaker as one stream
and the audiences as another it could assign an encoding group to VC2
in Capture Scene 2 and provide a CSE in Capture Scene 4:
CaptureScene1[Description=AustralianConfRoom,
VC1(role=audience)]
CaptureScene2[Description=ChinaConfRoom,
VC2(role=speaker){encodinggroup2},
VC3(role=audience),
CSE1(VC2,VC3)]
CaptureScene3[Description=USAConfRoom,
VC4(role=audience)]
CaptureScene4[MCC1(VC1,VC3,VC4){encodinggroup1},
CSE2(MCC1,VC2)]
Figure 4: MCU case: MCC with audience and separate speaker
Therefore a Consumer could choose whether or not to have a separate
"role=speaker" stream and could choose which endpoints to see. If it
wanted the second stream but not the Australian conference room it
could indicate the following captures in the Configure message:
MCC1(VC3,VC4),VC2
Figure 5: MCU case: Consumer Response
3.1.10.2. Several multiple content captures in the same Advertisement
Multiple MCCs can be used where multiple streams are used to carry
media from multiple endpoints. For example:
A conference has three endpoints D,E and F, each end point has three
video captures covering the left, middle and right regions of each
conference room. The MCU receives the following advertisements from
D and E:
Endpoint D CaptureScene1[Description=AustralianConfRoom,
VC1(left){encodinggroup1},
VC2(middle){encodinggroup2},
VC3(right){encodinggroup3},
Groves, et al. Expires February 13, 2014 [Page 18]
Internet-Draft Abbreviated Title August 2013
CSE1(VC1,VC2,VC3)]
Endpoint E CaptureScene1[Description=ChinaConfRoom,
VC1(left){encodinggroup1},
VC2(middle){encodinggroup2},
VC3(right){encodinggroup3},
CSE1(VC1,VC2,VC3)]
Figure 6: MCU case: Multiple captures from multiple endpoints
Note: The Advertisement uses the same identities. There is no co-
ordination between endpoints so it is likely there would be identity
overlap between received advertisements.
The MCU wants to offer Endpoint F three capture encodings. Each
capture encoding would contain a capture from either Endpoint D or
Endpoint E depending on the policy. The MCU would send the
following:
CaptureScene1[Description=AustralianConfRoom,
VC1(left),VC2(middle),VC3(right),
CSE1(VC1,VC2,VC3)]
CaptureScene2[Description=ChinaConfRoom,
VC4(left),VC5(middle),VC6(right),
CSE2(VC4,VC5,VC6)]
CaptureScene3[MCC1(VC1,VC4){encodinggroup1},
MCC2(VC2,VC5){encodinggroup2},
MCC3(VC3,VC6){encodinggroup3},
CSE3(MCC1,MCC2,MCC3)]
Figure 7: MCU case: Multiple MCCs for multiple captures
Note: The identities from Endpoint E have been renumbered so that
they are unique in the outgoing advertisement.
3.2. Multipoint Conferencing Framework Updates
The CLUE protocol extends the EP description defined in the
signalling protocol (SDP for SIP) by providing more information about
the available media. If we look at XCON it uses the information
available from the signalling protocol but instead of using SDP to
distribute the participants information and to control the multipoint
conference. This is done using a data structure defined in XML using
the CCMP protocol over HTML (note that CCMP can be used also over
CLUE channel if required). XCON provide a hierarchy the starts from
conference information that includes users having endpoints that have
media.
Groves, et al. Expires February 13, 2014 [Page 19]
Internet-Draft Abbreviated Title August 2013
The role is part of the user structure while the mixing mode is part
of the conference level information specifying the mixing mode per
each of the media available in the conference.
CLUE on the other end does not have such structure it start from what
is probably, in XCON terms, an end points that has media structured
by Scenes that has media. There is no user or conference level
information though the "role" proposal tries to add the user
information (note that use information is different from the role in
the call or the conference).
The XCON structure looks better when looking at a multipoint
conference. Yet it does not make sense to have such a data model for
the point to point calls. Therefore only going with this option
means that capture attribute information will not be available for
point to point calls.
3.3. Existing Parameter Updates
As discussed in section 2 the existing CLUE attributes surrounding
switching and composition have a number of open issues. This section
proposes changes to the text describing the attributes to better
describe their usage and interaction. It is also assumed that by
using these attributes there is no attempt to describe the any
component source capture information.
3.3.1. Composed
The current CLUE framework describes the "Composed" attribute as:
A boolean field which indicates whether or not the Media
Capture is a mix (audio) or composition (video) of streams.
This attribute is useful for a media consumer to avoid
nesting a composed video capture into another composed
capture or rendering. This attribute is not intended to
describe the layout a media provider uses when composing
video streams.
It is proposed to update the description:
Groves, et al. Expires February 13, 2014 [Page 20]
Internet-Draft Abbreviated Title August 2013
A boolean field which indicates whether or not the Media
Capture has been composed from a mix of audio sources or
several video sources. The sources may be local to the
provider (i.e. video capture device) or remote to the
provider (i.e. a media stream received by the provider from a
remote endpoint). This attribute is useful for a media
consumer to avoid nesting a composed video capture into
another composed capture or rendering.
This attribute does not imply anything with regards to the
attributes of the source audio or video except that the
composed capture will be contained in a capture encoding from
a single source. This attribute is not intended to describe
the layout a media provider uses when composing video
streams.
The "composed" attribute may be used in conjunction with a
"switched" attribute when one or more of the dynamic sources
is a composition.
3.3.2. Switched
The current CLUE framework describes the "Switched" attribute as:
A boolean field which indicates whether or not the Media
Capture represents the (dynamic) most appropriate subset of a
'whole'. What is 'most appropriate' is up to the provider
and could be the active speaker, a lecturer or a VIP.
It is proposed to update the description:
A boolean field which indicates whether the Media Capture
represents a dynamic representation of the capture scene that
contains the capture. It applies to both audio and video
captures.
A dynamic representation is one that provides alternate
capture sub-areas within the overall area of capture
associated with the capture over time in a single capture
encoding from one source. What capture area is contained in
the capture encoding at a particular time is dependent on the
provider policy. For example: a provider may encode the
active speaker or lecturer based on volume level. It is not
possible for consumers to associate attributes with a
particular capture sub-area nor to indicate which sub-capture
area they require.
Groves, et al. Expires February 13, 2014 [Page 21]
Internet-Draft Abbreviated Title August 2013
3.3.3. Scene-switch-policy
The current CLUE framework describes the "Scene Switch Policy"
attribute as:
Scene-switch-policy: {site-switch, segment-switch}
A media provider uses this scene-switch-policy attribute to
indicate its support for different switching policies. In
the provider's Advertisement, this attribute can have
multiple values, which means the provider supports each of
the indicated policies.
The consumer, when it requests media captures from this
Capture Scene Entry, should also include this attribute but
with only the single value (from among the values indicated
by the provider) indicating the Consumer's choice for which
policy it wants the provider to use. The Consumer must
choose the same value for all the Media Captures in the
Capture Scene Entry. If the provider does not support any of
these policies, it should omit this attribute.
The "site-switch" policy means all captures are switched at
the same time to keep captures from the same endpoint site
together. Let's say the speaker is at site A and everyone
else is at a "remote" site.
When the room at site A shown, all the camera images from
site A are forwarded to the remote sites. Therefore at each
receiving remote site, all the screens display camera images
from site A. This can be used to preserve full size image
display, and also provide full visual context of the
displayed far end, site A. In site switching, there is a
fixed relation between the cameras in each room and the
displays in remote rooms. The room or participants being
shown is switched from time to time based on who is speaking
or by manual control.
The "segment-switch" policy means different captures can
switch at different times, and can be coming from different
endpoints. Still using site A as where the speaker is, and
"remote" to refer to all the other sites, in segment
switching, rather than sending all the images from site A,
only the image containing the speaker at site A is shown.
The camera images of the current speaker and previous
speakers (if any) are forwarded to the other sites in the
conference.
Groves, et al. Expires February 13, 2014 [Page 22]
Internet-Draft Abbreviated Title August 2013
Therefore the screens in each site are usually displaying
images from different remote sites - the current speaker at
site A and the previous ones. This strategy can be used to
preserve full size image display, and also capture the non-
verbal communication between the speakers. In segment
switching, the display depends on the activity in the remote
rooms - generally, but not necessarily based on audio /
speech detection.
Firstly it is proposed to rename this attribute to "Capture Source
Synchronisation" in order to remove any confusion with the switch
attribute and also to remove the association with a scene as the any
information regarding source scenes is lost. This is due to that the
CSE represents the current scene. No change in functionality is
intended by the renaming. It is proposed to describe it as follows:
Capture Source Synchronisation: {source-synch,asynch}
By setting this attribute against a CSE it indicates that
each of the media captures specified within the CSE results
in a capture encoding that contains media related to
different remote sources. For example if CSE1 contains
VC1,VC2,VC3 then there will be three capture encodings sent
from the provider each displaying captures from different
remote sources. It is the provider that determines what the
source for the captures is. For example when the provider is
in an MCU it may determine that each separate CLUE endpoint
is a remote source of media. Likewise it is the provider
that determines how many remote sources are involved.
However it is assumed that each capture within the CSE will
contain the same number and set of sources.
"Source-synch" indicates that each capture encoding related
to the captures within the CSE contains media related to one
remote source at the same point in time.
"Asynch" indicates that that each capture encoding may
contain media related to any remote source at any point in
time.
If a provider supports both synchronisation methods it should
send separate CSEs containing separate captures, each CSE
with a separate capture source synchronisation label.
A provider when setting attributes against captures within a
Capture Source Synchronisation marked CSE should consider
that the media related to the remote sources may have its own
separate characteristics. For example: each source may have
Groves, et al. Expires February 13, 2014 [Page 23]
Internet-Draft Abbreviated Title August 2013
its own capture area therefore this needs to be taken into
account in the providers advertisement.
The "Switched" attribute may be used with a capture in a
"Capture Source Synchronisation" marked CSE. This indicates
that one or more of the remote sources associated with the
capture has dynamic media that may change within its own time
frame. i.e. the media from a remote source may change without
an impact on the other captures.
The "Composed" attribute may be used with captures in the
"Capture Source Synchronisation" marked CSE. This indicates
the capture encoding contains a composition or multiple
sources from one remote endpoint at a particular point in
time.
Furthermore it is assumed that if the current set of parameters is
maintained that the indication of the mechanism for the trigger of
switching sources (e.g. loudest source, round robin) is not possible
because the Consumer only chooses captures and not sources. If it's
purely up to the provider then this information would be superfluous.
It is proposed to capture this:
The trigger (or policy) that decides when a source is present
is up to the provider. The ability to provide detailed
information about sources is for further study.
3.3.4. MCU behaviour
When a CLUE endpoint is acting as a MCU it implies the need for an
advertisement aggregation function. That is the endpoint receives
CLUE advertisements from multiple endpoints uses this information,
its media processing capabilities and any policy information to form
advertisements to the other endpoints.
Contributor's note: TBD I think there needs to be a discussion here
about that source information is lost. How individual attributes are
affected. i.e. it may be possible to simply aggregate language
information but not so simple when there's different spatial
information. Also need to consider capture encodings.
4. Acknowledgements
This template was derived from an initial version written by Pekka
Savola and contributed by him to the xml2rfc project.
5. IANA Considerations
Groves, et al. Expires February 13, 2014 [Page 24]
Internet-Draft Abbreviated Title August 2013
It is not expected that the proposed changes present the need for any
IANA registrations.
6. Security Considerations
It is not expected that the proposed changes present any addition
security issues to the current framework.
7. References
7.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
7.2. Informative References
[I-D.groves-clue-capture-attr]
Groves, C., Yang, W., and R. Even, "CLUE media capture
description", draft-groves-clue-capture-attr-01 (work in
progress), February 2013.
[I-D.ietf-clue-framework]
Duckworth, M., Pepperell, A., and S. Wenger, "Framework
for Telepresence Multi-Streams", draft-ietf-clue-
framework-11 (work in progress), July 2013.
[I-D.ietf-clue-telepresence-requirements]
Romanow, A., Botzko, S., and M. Barnes, "Requirements for
Telepresence Multi-Streams", draft-ietf-clue-telepresence-
requirements-04 (work in progress), July 2013.
[I-D.ietf-clue-telepresence-use-cases]
Romanow, A., Botzko, S., Duckworth, M., and R. Even, "Use
Cases for Telepresence Multi-streams", draft-ietf-clue-
telepresence-use-cases-05 (work in progress), April 2013.
[RFC2629] Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629,
June 1999.
[RFC4575] Rosenberg, J., Schulzrinne, H., and O. Levin, "A Session
Initiation Protocol (SIP) Event Package for Conference
State", RFC 4575, August 2006.
[RFC6501] Novo, O., Camarillo, G., Morgan, D., and J. Urpalainen,
"Conference Information Data Model for Centralized
Conferencing (XCON)", RFC 6501, March 2012.
Groves, et al. Expires February 13, 2014 [Page 25]
Internet-Draft Abbreviated Title August 2013
[RFC6503] Barnes, M., Boulton, C., Romano, S., and H. Schulzrinne,
"Centralized Conferencing Manipulation Protocol", RFC
6503, March 2012.
Authors' Addresses
Christian Groves (editor)
Huawei
Melbourne
Australia
Email: Christian.Groves@nteczone.com
Weiwei Yang
Huawei
P.R.China
Email: tommy@huawei.com
Roni Even
Huawei
Tel Aviv
Isreal
Email: roni.even@mail01.huawei.com
Groves, et al. Expires February 13, 2014 [Page 26]