CLUE C. Groves, Ed.
Internet-Draft W. Yang
Intended status: Informational R. Even
Expires: February 12, 2014 Huawei
August 11, 2013

Describing Captures in CLUE and relation to multipoint conferencing
draft-groves-clue-multi-content-00

Abstract

In a multipoint Telepresence conference, there are more than two sites participating. Additional complexity is required to enable media streams from each participant to show up on the displays of the other participants. Common policies to address the multipoint case include "site-switch" and "segment-switch". The document will discuss these policies as well as the "composed" policy and how they work in the multipoint case.

The current CLUE framework document contains the "composed" and "switched" attributes to describe situations where a capture is mix or composition of streams or where the capture represents a dynamic subset of streams. "Composed" and "switched" are capture level attributes. In addition to these attributes the framework defines an attribute "Scene-switch-policy" on a capture scene entry (CSE) level which indicates how the captures are switched.

This draft discusses composition/switching in CLUE and makes a number of proposals to better define and support these capabilities.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on February 12, 2014.

Copyright Notice

Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.


Table of Contents

1. Introduction

One major objective for Telepresence is to be able to preserve the "Being there" user experience. However, in multi-site conferences it is often (in fact usually) not possible to simultaneously provide full size video, eye contact, common perception of gestures and gaze by all participants. Several policies can be used for stream distribution and display: all provide good results but they all make different compromises.

The policies are described in [I-D.ietf-clue-telepresence-use-cases]. [RFC6501] has the following requirement:

REQMT-14:
The solution MUST support mechanisms to make possible for either or both site switching or segment switching. [Edt: This needs rewording. Deferred until layout discussion is resolved.]

The policies described in the use case draft include the site-switch, segment-switch and composed policies.

Site switch is described in the CLUE use case "One common policy is called site switching. Let's say the speaker is at site A and everyone else is at a "remote" site. When the room at site A shown, all the camera images from site A are forwarded to the remote sites. Therefore at each receiving remote site, all the screens display camera images from site A. This can be used to preserve full size image display, and also provide full visual context of the displayed far end, site A. In site switching, there is a fixed relation between the cameras in each room and the displays in remote rooms. The room or participants being shown is switched from time to time based on who is speaking or by manual control, e.g., from site A to site B."

These policies are mirrored in the framework document through a number of attributes.

Currently in the CLUE framework document [I-D.ietf-clue-framework] there are two media capture attributes: Composed and Switched.

Composed is defined as:

A field with a Boolean value which indicates whether or not the Media Capture is a mix (audio) or composition (video) of streams.
This attribute is useful for a media consumer to avoid nesting a composed video capture into another composed capture or rendering. This attribute is not intended to describe the layout a media provider uses when composing video streams.

Switched is defined as:

A field with a Boolean value which indicates whether or not the Media Capture represents the (dynamic) most appropriate subset of a 'whole'. What is 'most appropriate' is up to the provider and could be the active speaker, a lecturer or a VIP.

There is also a Capture Scene Entry (CSE) attribute "scene switch policy" defined as:

A media provider uses this scene-switch-policy attribute to indicate its support for different switching policies.

2. Issues

This section discusses a number of issues in the current framework around the support of switched/composed captures and media streams when considering multipoint conferencing. Some issues are more required functions and some are related to the current description in the framework document.

2.1. Role of an MCU in a multipoint conference

In a multipoint conference there is a central control point (MCU). The MCU will have the CLUE advertisements from all the conference participants and will prepare and send advertisements to all the conference participants. The MCU will also have more information about the conference, participants and media which it receives at conference creation and via call signalling. This data is not stable since each user who joins or leaves the conference causes a change is conference state. An MCU supporting SIP may utilise the Conference event package, XCON and CCMP to maintain and distribute conference state.

[RFC4575] defines a conference event package. Using the event framework notifications are sent about changes in the membership of this conference and optionally about changes in the state of additional conference components. The conference information is composed of the conference description, host information, conference state, users that has endpoints where each endpoint includes the media description.

[RFC6501] extends the conference event package and tries to be signalling protocol agnostic. RFC6501 adds new elements but also provides values for some of the elements defined in RFC4575, for example it defines roles ( like "administrator", "moderator", "user", "participant", "observer", and "none").

[RFC6503] Centralized Conferencing Manipulation Protocol (CCMP) allows authenticated and authorized users to create, manipulate, and delete conference objects. Operations on conferences include adding and removing participants, changing their roles, as well as adding and removing media streams and associated endpoints.

CCMP implements the client-server model within the XCON framework, with the conferencing client and conference server acting as client and server, respectively. CCMP uses HTTP as the protocol to transfer requests and responses, which contain the domain-specific XML-encoded data objects defined in [RFC6501] "Conference Information Data Model for Centralized Conferencing (XCON)".

The XCON data model and CCMP provides a generic way to create and control conferences. CCMP is not SIP specific but SIP endpoint will subscribe to the conference event package to get information about changes in the conference state.

Therefore when a MCU implements the above protocols there will be an interaction between any CLUE states and those within a conferencing framework. For example: if an endpoint leaves a conference this will mean that an MCU may need to indicate via CLUE to the other endpoints that those captures are no longer available and it would also need to indicate via the Conferencing framework that the endpoint is longer part of the conference.

The question is how do these concepts relate as the Conferencing framework does not have the concept of captures or scenes? Other aspects overlap, for example:

The conference framework has "available media" , CLUE has encodings to indicate codec.
The conference framework has "users", CLUE has no concept of users although it has capture attributes that relate to the users in a capture.

It is noted point to point calls may not implement the conferencing framework. It is desirable that CLUE procedures be the same whether an endpoint is communicating with a peer endpoint or an MCU.

2.2. Relation to scene

One of the early justifications for switching / composition was the ability to switch between sites. When looking at the CLUE framework there is no concept of "site" in the CLUE hierarchy. The closest concept is an "endpoint" but this has no identity within the CLUE syntax. The highest level is the "clueInfo" that includes captureScenes and an endpoint may have multiple capture scenes.

If the switched and composed attributes are specified at a capture level it is not clear what the correlation is between the capture and the endpoint / scenes, particularly when the attributes are described in the context of sites. A scene may be composed of multiple captures. Where an MCU is involved in a conference with multiple endpoints, multiple capture scenes are involved. It becomes difficult to map all the scenes and capture information from the source endpoints into one capture scene sent to an endpoint. Discussion of switching, composition et al. needs to be described in terms of the CLUE concepts.

When considering the SIP conferencing framework it can be seen that there are complications with interworking with the scene concept. There may be multiple media of the same type e.g room view and presentation but they are easily identified. This also needs to be considered.

2.3. Description of the contents of a switched/composed capture

When considering switching and composition whilst this may be represented by one capture and one resulting media stream there may be multiple original source captures. Each of these source captures would have had their own set of attributes. A media capture with the composed attribute allows the description of the capture as whole but not a description of the constituent parts. In the case of a MCU taking multiple input media captures and compositing them into one output capture the CLUE related characteristics of these inputs are lost in the current solution. Alternate methods such as the RFC6501 layout field etc. may need to be investigated.

Consider the case where MCUs receive CLUE advertisements from various endpoints. Having a single capture with a switched attribute makes it difficult to fully express what the content is when it is from multiple endpoints. It may be possible to specify lists of capture attribute values when sending an advertisement from the MCU, i.e. role=speaker,audience but it becomes difficult to relate multiple attributes, i.e. (role=speaker,language=English),(role=audience,language=french).

One capture could represent source captures from multiple locations. A consumer may wish to examine the inputs to a switched capture , i.e. choose which of the original endpoints it wants to see/hear. In order to do this the original capture information would need to be conveyed in a manner that minimises overhead for the MCU.

By being able to link multiple source captures to one mixed (switched/composed) capture in a CLUE advertisement allows a fuller description of the content of the capture.

2.4. Attribute interaction

Today the "composed" and "switched" attributes appear at a media capture level. If "switched" is specified for multiple captures in a capture scene it's not clear from the framework what the switching policy is. For example: If a CSE contains three VCs each with "switched" does the switch occurs between these captures? Does the switch occur internal to each capture?

The "scene-switch-policy" CSE attribute has been defined to indicate switch policy but there doesn't appear to be a description of whether this only relates to captures marked with "switch" and/or "composed"? If a CSE marked with "scene-switch-policy" contains non-switched, non-composed captures what does this mean?

What are the interactions between the two properties? E.g. Are "switched" and "composed" attributes mutually exclusive or not? Is switched capture with a scene switch policy of "segment-switched" a "composed" capture?

These issues need to be clarified in the framework.

2.5. Policy

The "Scene-switch-policy" attribute allows the indication of whether switched captures are "site" or "segment" switched. However there is no indication of what the switch or the composition "trigger" policy is. Content could be provided based on a round robin view, loudest speaker etc. Where an advertising endpoint supports different algorithms it would be advantageous for a consumer to know and select an applicable policy.

2.6. Media stream composition and encodings

Whether single or multiple streams are used for switched captures is not clear from the capture description. For example:

There are 3 endpoints (A,B,C) each with 3 video captures.(VCa1,VCa2,VCa3, etc.). A MCU wants to indicate to endpoint C that it can offer a switched view of endpoints A and B.

It could send an Advertisement with CSE (VCa1,VCa2,VCa3, VCb1,VCb2,VCb3),scene-switch-policy=site-switch.

Normally such a configuration (without the switch policy) would relate to 6 different media streams. Switching introduces several possibilities.

For site switching:

a)
There could one media stream with the contents of all the 6 captures. The MCU always send a composed image with the VCs from the applicable end point.
b)
There could be two media streams each containing the VCs from one endpoint, the MCU chooses which stream to send.
c)
There could be 6 media streams. The MCU chooses which 3 streams to send.

For segment switching this is further complicated because the MCU may choose to send media related to endpoint A or B. There no text describing any limitation so the MCU may send 1 VC or 5.

Utilising CLUE "encodings" may be a way to describe how the switch is taking place in terms of media provided but such description is missing from the framework. One could assume that an individual encoding be assigned to multiple media captures (i.e. multiple VCs to indicate they are encoded in the same stream) but again this is problematic as the framework indicates that "An Individual Encoding can be assigned to at most one Capture Encoding at any given time."

This could do with further clarification in the framework.

2.7. Relation of switched captures to simultaneous transmission sets

Simultaneous Transmission Set is defined as "a set of Media Captures that can be transmitted simultaneously from a Media Provider." It's not clear how this definition would relate to switched or composed streams. The captures may not be able to be sent at the same time but may form a timeslot on a particular stream. They may be provided together but not at precisely the same time.

The current version of the framework in section 6.3 it indicates that:

"It is a syntax conformance requirement that the simultaneous transmission sets must allow all the media captures in any particular Capture Scene Entry to be used simultaneously."

If switching or composition is specified at a capture level only it is evident that simultaneity constraints do not come into play. However if multiple captures are used in a single media stream I.e. associated with the CSE then these may be subject to a simultaneous transmission set description.

It is also noted that there is a similar issue for encoding group. See section 8/[Framework]:

"It is a protocol conformance requirement that the Encoding Groups must allow all the Captures in a particular Capture Scene Entry to be used simultaneously."

If "switching" is used then there is no need to send the encodings at the same time.

This needs to be clarified.

2.8. Conveying spatial information for switched/composed captures

CLUE currently allows the ability to signal spatial information related to a media capture. It is unclear in the current draft how this would work with switching/composition. In section 6.1 / [I-D.ietf-clue-telepresence-use-cases] it does say:

"For a switched capture that switches between different sections within a larger area, the area of capture should use coordinates for the larger potential area."

This describes a single capture not when there are multiple switched captures. It appears to focus on segment switching rather than site switching and does not appear to cover "composed" (if it is related).

An advertiser may or may not want to use common spatial attributes for captures associated with a switched captures. For example: it may be beneficial for the Advertiser in a composed image to indicate that different captures have a different capture area in a virtual space.

This should be given consideration in the framework.

2.9. Consumer selection

In section 6.2.2 of version 9 [I-D.ietf-clue-framework] it indicates that an Advertiser may provide multiple values for the "scene-switch-policy" and that the Consumer may choose and return the value it requires.

In version 9 of the framework there was no mechanism in CLUE for a Consumer to choose and return individual values from capture scene, CSE or media capture attributes.

In version 10 of the framework the text was updated to indicate that the consumer could choose values from a list. It is not clear that this capability is needed as the procedure only relates to the "scene-switch-policy". The switching policy may be better specified by other means.

3. Proposal

As has been discussed above there are a number of issues with regards to the support of switched/composed captures/streams in CLUE particularly when considering MCUs. The authors believe that there is no single action that can address the above issues. Several options are discussed below. The options are not mutually exclusive.

1)
Introduce syntax to CLUE to better describe source captures
2)
Introduce updates to the XCON conferencing framework (e.g. Conference package, XCON etc.) to introduce CLUE concepts.
3)
Update CLUE to better describe the current suite of attributes with the understanding these provide limited information with respect to source information.

3.1. CLUE Syntax Updates

The authors believe that there are a number of requirements for this:

-
It should be possible to advertise the individual captures that make up a single switched/composed media stream before receiving the actual media stream.
-
It should be possible to describe the relationship between captures that make up a single switched/composed media stream.
-
It should be possible to describe this using CLUE semantics rather than with terms such as "site" or "segment" which need their own definition.

The authors also believe that whether media is composed, segment switched, site switched the common element is that the media stream contains multiple captures from potentially multiple sources.

[I-D.ietf-clue-framework] does have the "Scene-switch-policy" attribute at a CSE level but as described in section 2 it is not sufficient for several reasons. E.g. it is not possible assign an encoding to a CSE, a CSE cannot reference captures from multiple scenes and there is a relationship with STSs that needs to be considered.

In order to be able to fully express and support media stream with multiple captures the authors propose a new type of capture, the "multiple content capture" (MCC). The MCC is essentially the same as audio or video captures in that it may have its own attributes the main difference is that it can also include other captures. It indicates that the MCC capture is composed of other captures. This composition may be positional (i.e. segments/tiling) or time composition (switched) etc. and specified by a policy attribute. The MCC can be assigned an encoding. For example:

MCC1(VC1,VC2,VC3),[POLICY]

This would indicate that MCC1 is composed of 3 video captures according to the policy.

One further difference is that a MCC may reference individual captures from multiple scenes. For example:

CS#1(VC1,VC2)
CS#2(VC3,VC4)
CS#3(MCC1(VC1,VC3))

This would indicate that scene #3 contains a MCC that is composed from individual encodings VC1 and VC3. This allows the consumer to associate any capture scene properties from the original scene with the multiple content capture.

The MCC would be able to be utilised by both normal endpoints and MCUs. For example: it would allow an endpoint to construct a mixed video stream that is a virtual scene with a composition of presentation video and individual captures.

This proposal does not consider any relation to the SIP conferencing framework.

The sections below provide more detail on the proposal.

3.1.1. Definitions

Multiple content capture: Media capture for audio or video that indicates the capture contains multiple audio or video captures. Individual media captures may or may not be present in the resultant capture encoding depending on time or space. Denoted as MCCn in the example cases in this document.

3.1.2. Multiple Content Capture Details

The MCC indicates that multiple captures are contained in one media capture by referencing the applicable individual media captures. Only one capture type (i.e. audio, video, etc.) is allowed in each MCC instance. The MCC contains a reference to the media captures as well attributes associated with the MCC itself. The MCC may reference individual captures from other capture scenes. If an MCC is used in a CSE that CSE may also reference captures from other Capture Scenes.

Note: Different Capture Scenes are not spatially related.

Each instance of the MCC has its own captureID i.e. MCC1. This allows all the individual captures contained in the MCC to be referenced by a single ID.

The example below shows the use of a MultipleContent capture:

	CaptureScene1 [VC1 {attributes},
                   VC2 {attributes},
                   VC3 {attributes},
				   MCC1(VC1,VC2,VC3){attributes}]
           

This indicates that MCC1 is a single capture that contains the captures VC1, VC2 and VC3 according to any MCC1 attributes.

One or more MCCs may also specified in a CSE. This allows an Advertiser to indicate that several MCC captures are used to represent a capture scene.

Note: Section 6.1/[I-D.ietf-clue-framework] indicates that "A Media Capture is associated with exactly one Capture Scene". For MCC this could be further clarified to indicate that "A Media Capture is defined in a capture scene and is given an advertisement unique identity. The identity may be referenced outside the Capture Scene that defines it through a multiple content capture (MCC).

3.1.3. MCC Attributes

Attributes may be associated with the MCC instance and the individual captures that the MCC references. A provider should avoid providing conflicting attribute values between the MCC and individual captures. Where there is conflict the attributes of the MCC override any that may be present in the individual captures.

There are two MCC specific attributes "MaxCaptures" and "Policy" which are used to give more information regarding when the individual captures appears and what policy is used to determine this.

The spatial related attributes can be further used to determine how the individual captures "appear" within a stream. For example a virtual scene could be constructed for the MCC capture with two video captures with a "MaxCaptures" attribute of 2 and an "area of capture" attribute provided with an overall area. Each of the individual captures could then also include an "area of capture" attribute with a sub-set of the overall area. The consumer would then know the relative position of the content in the composed stream. For example: The above capture scene may indicate that VC1 has an x-axis capture area 1-5, VC2 6-10 and VC3 11-15. The MCC capture may indicate an x-axis capture area 1-15.

3.1.4. MCC Attributes

MaxCaptures:{integer}

This field is only associated with MCCs and indicates the maximum number of individual captures that may appear in a capture encoding at a time. It may be used to derive how the individual captures within the MCC are composed with regards to space and time. Individual content in the capture may be switched in time so that only one of the individual captures/CSEs are shown (MaxCaptures:1). The individual captures may be composed so that they are all shown in the MCC (MaxCaptures:n).

For example:

MCC1(VC1,VC2,VC3),MaxCaptures:1

This would indicate that the Advertiser in the capture encoding would switch (or compose depending on policy) between VC1, VC2 or VC3 as there may be only a maximum of one capture at a time.

3.1.5. Composition policy

TBD - This attribute is to address what algorithm the endpoint/MCU uses to determine what appears in the MCC captures. E.g. loudest, round robin.

3.1.6. Synchronisation

Note: The {scene-switch-policy} attribute has values that indicates "site-switch" or "segment" switch. The distinction between these is that "site-switch" indicates that when there is mixed content that captures related to an endpoint appear together. "segment-switch" indicates that different endpoints captures could appear together. An issue is that a Consumer has no concept of "endpoints" only "capture scenes". Also as highlighted a Consumer has no method to return parameters for CSEs.

The use of MCCs enables the Advertiser to communicate to the Consumer that captures originate from different captures scenes. In cases where multiple MCCs represent a scene (i.e. multiple MCCs in a CSE) an Advertiser may wish to indicate that captures from one capture scene are present in the capture encodings of specified MCCs at the same time. Having an attribute at capture level removes the need for CSE level attributes which are problematic for consumers.

Synch-id: { integer}

This MCC attribute indicates how the individual captures in multiple MCC captures are synchronised. To indicate that the capture encodings associated with MCCs contain captures from the source at the same time the Advertiser should set the same SynchID on each of the concerned MCCs. It is the provider that determines what the source for the captures is. For example when the provider is in an MCU it may determine that each separate CLUE endpoint is a remote source of media.

For example:

	CaptureScene1[Description=AustralianConfRoom,
                  VC1(left),VC2(middle),VC3(right),
				  CSE1(VC1,VC2,VC3)]
	CaptureScene2[Description=ChinaConfRoom,
	              VC4(left),VC5(middle),VC6(right),
                  CSE2(VC4,VC5,VC6)]
	CaptureScene3[MCC1(VC1,VC4){Sync-id:1}{encodinggroup1},
                  MCC2(VC2,VC5){Sync-id:1}{encodinggroup2},
                  MCC3(VC3,VC6){encodinggroup3},
                  CSE3(MCC1,MCC2,MCC3)]
           

Figure 1: Synchronisation Example

The above advertisement would indicate MCC1,MCC2,MCC3 make up a capture scene. There would be three capture encodings. Because MCC1 and MCC2 have the same Sync-id, each encoding1 and encoding2 would together have content from only capture scene 1 or only capture scene 2 at a particular point in time. Encoding3 would not be synchronised with encoding1 or encoding2.

Without this attribute it is assumed that multiple MCCs may provide different sources at any particular point in time.

3.1.7. MCC and encodings

MCCs shall be assigned an encoding group and thus become a capture encoding. The captures referenced by the MCC do not need to be assigned to an encoding group. This means that all the individual captures referenced by the MCC will appear in the capture encoding according to any MCC attributes. This allows an Advertiser to specify capture attributes associated with the individual captures without the need to provide an individual capture encoding for each of the inputs.

If an encoding group is assigned to an individual capture referenced by the MCC it indicates that this capture may also have an individual capture encoding.

For example:

    CaptureScene1 [VC1 {encoding group1},
                   VC2 ]
			       MCC1(VC1,VC2){encoding group3}] 
           

This would indicate that VC1 may be sent as its own capture encoding from encoding group1 or that it may be sent as part of a capture encoding from encoding group3 along with VC2.

Note: The section 8/[I-D.ietf-clue-framework] indicates that every capture is associated with an encoding group. To utilise MCCs this requirement has to be relaxed.

3.1.8. MCCs and STSs

The MCC can be used in simultaneous sets, therefore providing a means to indicate whether several multiple content captures can be provided at the same time. Captures within a MCC can be provided together but not necessarily at the same time. Therefore by specifying a MCC in an STS it does not indicate that all the referenced individual captures may be present at a time. The MaxCaptures attributes indicates the maximum number of captures that may be present.

An MCC instance of is limited to one media type e.g. video, audio, text.

Note: This gets around the problem where the framework says that all captures (even switched ones) within a CSE have to be allowed in a STS to be sent at the same time.

3.1.9. Consumer Behaviour

On receipt of an advertisement with an MCC the Consumer treats the MCC as per other individual captures with the following differences:

-
The Consumer would understand that the MCC is a capture that includes the referenced individual captures and that these individual captures would be delivered as part of the MCC's capture encoding.
-
The Consumer may utilise any of the attributes associated with the referenced individual captures and any capture scene attributes from where the individual capture was defined to choose the captures.
-
The Consumer may or may not want to receive all the indicated captures. Therefore it can choose to receive a sub-set of captures indicated by the MCC.

For example if the Consumer receives:

MCC1(VC1,VC2,VC3){attributes}

A Consumer should choose all the captures within a MCCs however if the consumer determines that it doesn't want VC3 it can return MCC1(VC1,VC2). If it wants all the individual capture then it returns just a reference to the MCC (i.e. MCC1).

Note: The ability to return a subset of capture is for consistency with the current framework. It says that a Consumer should choose all the captures from a CSE but it allows it to select a subset (if the STS is provided). The intent was to provide equivalent functionality for a MCC.

3.1.10. MCU Behaviour

The use of MCCs allows the MCU to easily construct outgoing Advertisements. The following sections provide several examples.

3.1.10.1. Single content captures and multiple contents capture in the same Advertisement

Four endpoints are involved in a CLUE session. To formulate an Advertisement to endpoint 4 the following Advertisements received from endpoint 1 to 3 and used by the MCU. Note: The IDs overlap in the incoming advertisements. The MCU is responsible for making these unique in the outgoing advertisement.

    Endpoint 1 CaptureScene1[Description=AustralianConfRoom,
                             VC1(role=audience)]
    Endpoint 2 CaptureScene1[Description=ChinaConfRoom,
                             VC1(role=speaker),VC2(role=audience),
						     CSE1(VC1,VC2)]
    Endpoint 3 CaptureScene1[Description=USAConfRoom,
                             VC1(role=audience)]
           

Figure 2: MCU case: Received advertisements

Note: Endpoint 2 above indicates that it sends two streams.

If the MCU wanted to provide a multiple content capture containing the audience of the 3 endpoints and the speaker it could construct the following advertisement:

    CaptureScene1[Description=AustralianConfRoom,
	              VC1(role=audience)]
    CaptureScene2[Description=ChinaConfRoom,
	              VC2(role=speaker),VC3(role=audience),
                  CSE1(VC2,VC3)]
    CaptureScene3[Description=USAConfRoom,
	              VC4(role=audience)]
    CaptureScene4[MCC1(VC1,VC2,VC3,VC4){encodinggroup1}]
           

Figure 3: MCU case: MCC with multiple audience and speaker

Alternatively if the MCU wanted to provide the speaker as one stream and the audiences as another it could assign an encoding group to VC2 in Capture Scene 2 and provide a CSE in Capture Scene 4:

	CaptureScene1[Description=AustralianConfRoom,
	              VC1(role=audience)]
	CaptureScene2[Description=ChinaConfRoom,
	              VC2(role=speaker){encodinggroup2},
				  VC3(role=audience),
				  CSE1(VC2,VC3)]
	CaptureScene3[Description=USAConfRoom,
	              VC4(role=audience)]
	CaptureScene4[MCC1(VC1,VC3,VC4){encodinggroup1},
				  CSE2(MCC1,VC2)]
           

Figure 4: MCU case: MCC with audience and separate speaker

Therefore a Consumer could choose whether or not to have a separate "role=speaker" stream and could choose which endpoints to see. If it wanted the second stream but not the Australian conference room it could indicate the following captures in the Configure message:

	MCC1(VC3,VC4),VC2
           

Figure 5: MCU case: Consumer Response

3.1.10.2. Several multiple content captures in the same Advertisement

Multiple MCCs can be used where multiple streams are used to carry media from multiple endpoints. For example:

A conference has three endpoints D,E and F, each end point has three video captures covering the left, middle and right regions of each conference room. The MCU receives the following advertisements from D and E:

	Endpoint D CaptureScene1[Description=AustralianConfRoom,
                             VC1(left){encodinggroup1},
                             VC2(middle){encodinggroup2},
                             VC3(right){encodinggroup3},
                             CSE1(VC1,VC2,VC3)]
	Endpoint E CaptureScene1[Description=ChinaConfRoom,
                             VC1(left){encodinggroup1},
                             VC2(middle){encodinggroup2},
                             VC3(right){encodinggroup3},
                             CSE1(VC1,VC2,VC3)]
           

Figure 6: MCU case: Multiple captures from multiple endpoints

Note: The Advertisement uses the same identities. There is no co-ordination between endpoints so it is likely there would be identity overlap between received advertisements.

The MCU wants to offer Endpoint F three capture encodings. Each capture encoding would contain a capture from either Endpoint D or Endpoint E depending on the policy. The MCU would send the following:

	CaptureScene1[Description=AustralianConfRoom,
				  VC1(left),VC2(middle),VC3(right),
				  CSE1(VC1,VC2,VC3)]
	CaptureScene2[Description=ChinaConfRoom,
	              VC4(left),VC5(middle),VC6(right),
				  CSE2(VC4,VC5,VC6)]
	CaptureScene3[MCC1(VC1,VC4){encodinggroup1},
				  MCC2(VC2,VC5){encodinggroup2},
                  MCC3(VC3,VC6){encodinggroup3},
				  CSE3(MCC1,MCC2,MCC3)]
           

Figure 7: MCU case: Multiple MCCs for multiple captures

Note: The identities from Endpoint E have been renumbered so that they are unique in the outgoing advertisement.

3.2. Multipoint Conferencing Framework Updates

The CLUE protocol extends the EP description defined in the signalling protocol (SDP for SIP) by providing more information about the available media. If we look at XCON it uses the information available from the signalling protocol but instead of using SDP to distribute the participants information and to control the multipoint conference. This is done using a data structure defined in XML using the CCMP protocol over HTML (note that CCMP can be used also over CLUE channel if required). XCON provide a hierarchy the starts from conference information that includes users having endpoints that have media.

The role is part of the user structure while the mixing mode is part of the conference level information specifying the mixing mode per each of the media available in the conference.

CLUE on the other end does not have such structure it start from what is probably, in XCON terms, an end points that has media structured by Scenes that has media. There is no user or conference level information though the "role" proposal tries to add the user information (note that use information is different from the role in the call or the conference).

The XCON structure looks better when looking at a multipoint conference. Yet it does not make sense to have such a data model for the point to point calls. Therefore only going with this option means that capture attribute information will not be available for point to point calls.

3.3. Existing Parameter Updates

As discussed in section 2 the existing CLUE attributes surrounding switching and composition have a number of open issues. This section proposes changes to the text describing the attributes to better describe their usage and interaction. It is also assumed that by using these attributes there is no attempt to describe the any component source capture information.

3.3.1. Composed

The current CLUE framework describes the "Composed" attribute as:

A boolean field which indicates whether or not the Media Capture is a mix (audio) or composition (video) of streams.
This attribute is useful for a media consumer to avoid nesting a composed video capture into another composed capture or rendering. This attribute is not intended to describe the layout a media provider uses when composing video streams.

It is proposed to update the description:

A boolean field which indicates whether or not the Media Capture has been composed from a mix of audio sources or several video sources. The sources may be local to the provider (i.e. video capture device) or remote to the provider (i.e. a media stream received by the provider from a remote endpoint). This attribute is useful for a media consumer to avoid nesting a composed video capture into another composed capture or rendering.
This attribute does not imply anything with regards to the attributes of the source audio or video except that the composed capture will be contained in a capture encoding from a single source. This attribute is not intended to describe the layout a media provider uses when composing video streams.
The "composed" attribute may be used in conjunction with a "switched" attribute when one or more of the dynamic sources is a composition.

3.3.2. Switched

The current CLUE framework describes the "Switched" attribute as:

A boolean field which indicates whether or not the Media Capture represents the (dynamic) most appropriate subset of a 'whole'. What is 'most appropriate' is up to the provider and could be the active speaker, a lecturer or a VIP.

It is proposed to update the description:

A boolean field which indicates whether the Media Capture represents a dynamic representation of the capture scene that contains the capture. It applies to both audio and video captures.
A dynamic representation is one that provides alternate capture sub-areas within the overall area of capture associated with the capture over time in a single capture encoding from one source. What capture area is contained in the capture encoding at a particular time is dependent on the provider policy. For example: a provider may encode the active speaker or lecturer based on volume level. It is not possible for consumers to associate attributes with a particular capture sub-area nor to indicate which sub-capture area they require.

3.3.3. Scene-switch-policy

The current CLUE framework describes the "Scene Switch Policy" attribute as:

Scene-switch-policy: {site-switch, segment-switch}
A media provider uses this scene-switch-policy attribute to indicate its support for different switching policies. In the provider's Advertisement, this attribute can have multiple values, which means the provider supports each of the indicated policies.
The consumer, when it requests media captures from this Capture Scene Entry, should also include this attribute but with only the single value (from among the values indicated by the provider) indicating the Consumer's choice for which policy it wants the provider to use. The Consumer must choose the same value for all the Media Captures in the Capture Scene Entry. If the provider does not support any of these policies, it should omit this attribute.
The "site-switch" policy means all captures are switched at the same time to keep captures from the same endpoint site together. Let's say the speaker is at site A and everyone else is at a "remote" site.
When the room at site A shown, all the camera images from site A are forwarded to the remote sites. Therefore at each receiving remote site, all the screens display camera images from site A. This can be used to preserve full size image display, and also provide full visual context of the displayed far end, site A. In site switching, there is a fixed relation between the cameras in each room and the displays in remote rooms. The room or participants being shown is switched from time to time based on who is speaking or by manual control.
The "segment-switch" policy means different captures can switch at different times, and can be coming from different endpoints. Still using site A as where the speaker is, and "remote" to refer to all the other sites, in segment switching, rather than sending all the images from site A, only the image containing the speaker at site A is shown. The camera images of the current speaker and previous speakers (if any) are forwarded to the other sites in the conference.
Therefore the screens in each site are usually displaying images from different remote sites - the current speaker at site A and the previous ones. This strategy can be used to preserve full size image display, and also capture the non-verbal communication between the speakers. In segment switching, the display depends on the activity in the remote rooms - generally, but not necessarily based on audio / speech detection.

Firstly it is proposed to rename this attribute to "Capture Source Synchronisation" in order to remove any confusion with the switch attribute and also to remove the association with a scene as the any information regarding source scenes is lost. This is due to that the CSE represents the current scene. No change in functionality is intended by the renaming. It is proposed to describe it as follows:

Capture Source Synchronisation: {source-synch,asynch}
By setting this attribute against a CSE it indicates that each of the media captures specified within the CSE results in a capture encoding that contains media related to different remote sources. For example if CSE1 contains VC1,VC2,VC3 then there will be three capture encodings sent from the provider each displaying captures from different remote sources. It is the provider that determines what the source for the captures is. For example when the provider is in an MCU it may determine that each separate CLUE endpoint is a remote source of media. Likewise it is the provider that determines how many remote sources are involved. However it is assumed that each capture within the CSE will contain the same number and set of sources.
"Source-synch" indicates that each capture encoding related to the captures within the CSE contains media related to one remote source at the same point in time.
"Asynch" indicates that that each capture encoding may contain media related to any remote source at any point in time.
If a provider supports both synchronisation methods it should send separate CSEs containing separate captures, each CSE with a separate capture source synchronisation label.
A provider when setting attributes against captures within a Capture Source Synchronisation marked CSE should consider that the media related to the remote sources may have its own separate characteristics. For example: each source may have its own capture area therefore this needs to be taken into account in the providers advertisement.
The "Switched" attribute may be used with a capture in a "Capture Source Synchronisation" marked CSE. This indicates that one or more of the remote sources associated with the capture has dynamic media that may change within its own time frame. i.e. the media from a remote source may change without an impact on the other captures.
The "Composed" attribute may be used with captures in the "Capture Source Synchronisation" marked CSE. This indicates the capture encoding contains a composition or multiple sources from one remote endpoint at a particular point in time.

Furthermore it is assumed that if the current set of parameters is maintained that the indication of the mechanism for the trigger of switching sources (e.g. loudest source, round robin) is not possible because the Consumer only chooses captures and not sources. If it's purely up to the provider then this information would be superfluous. It is proposed to capture this:

The trigger (or policy) that decides when a source is present is up to the provider. The ability to provide detailed information about sources is for further study.

3.3.4. MCU behaviour

When a CLUE endpoint is acting as a MCU it implies the need for an advertisement aggregation function. That is the endpoint receives CLUE advertisements from multiple endpoints uses this information, its media processing capabilities and any policy information to form advertisements to the other endpoints.

Contributor's note: TBD I think there needs to be a discussion here about that source information is lost. How individual attributes are affected. i.e. it may be possible to simply aggregate language information but not so simple when there's different spatial information. Also need to consider capture encodings.

4. Acknowledgements

This template was derived from an initial version written by Pekka Savola and contributed by him to the xml2rfc project.

5. IANA Considerations

It is not expected that the proposed changes present the need for any IANA registrations.

6. Security Considerations

It is not expected that the proposed changes present any addition security issues to the current framework.

7. References

7.1. Normative References

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.

7.2. Informative References

[RFC2629] Rose, M.T., "Writing I-Ds and RFCs using XML", RFC 2629, June 1999.
[RFC4575] Rosenberg, J., Schulzrinne, H. and O. Levin, "A Session Initiation Protocol (SIP) Event Package for Conference State", RFC 4575, August 2006.
[RFC6501] Novo, O., Camarillo, G., Morgan, D. and J. Urpalainen, "Conference Information Data Model for Centralized Conferencing (XCON)", RFC 6501, March 2012.
[RFC6503] Barnes, M., Boulton, C., Romano, S. and H. Schulzrinne, "Centralized Conferencing Manipulation Protocol", RFC 6503, March 2012.
[I-D.ietf-clue-framework] Duckworth, M., Pepperell, A. and S. Wenger, "Framework for Telepresence Multi-Streams", Internet-Draft draft-ietf-clue-framework-09, February 2013.
[I-D.ietf-clue-telepresence-use-cases] Romanow, A., Botzko, S., Duckworth, M., Even, R. and I. Communications, "Use Cases for Telepresence Multi-streams", Internet-Draft draft-ietf-clue-telepresence-use-cases-04, August 2012.
[I-D.ietf-clue-telepresence-requirements] Romanow, A. and S. Botzko, "Requirements for Telepresence Multi-Streams", Internet-Draft draft-ietf-clue-telepresence-requirements-03, January 2013.
[I-D.groves-clue-capture-attr] Groves, C., Yang, W. and R. Even, "CLUE media capture description", Internet-Draft draft-groves-clue-capture-attr-01, February 2013.

Authors' Addresses

Christian Groves (editor) Huawei Melbourne, Australia EMail: Christian.Groves@nteczone.com
Weiwei Yang Huawei P.R.China EMail: tommy@huawei.com
Roni Even Huawei Tel Aviv, Isreal EMail: roni.even@mail01.huawei.com