CLUE WG | M. Duckworth |
Internet-Draft | Polycom |
Intended status: Informational | June 15, 2013 |
Expires: December 17, 2013 |
CLUE Switching Mixer Example
draft-duckworth-clue-switching-example-00
This document presents an example multipoint use case scenario for CLUE. This example uses the media switching variety of the Topo-Mixer RTP topology. This example is intended to promote discussion about how to implement it using the CLUE Framework, and whether or not the framework as currently defined is sufficient to enable this use case.
This first version is incomplete, and is intended to raise questions and prompt discussion.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on December 17, 2013.
Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
This document presents an example multipoint use case scenario for CLUE. This example uses the media switching variety of the Topo-Mixer RTP topology. This example is intended to promote discussion about how to implement it using the CLUE Framework [I-D.ietf-clue-framework], and whether or not the framework as currently defined is sufficient to enable this use case.
From the requirements document [I-D.ietf-clue-telepresence-requirements]:
This example uses the switching approach.
[I-D.ietf-clue-rtp-mapping] says media-switching mixer is one of the RTP topologies relevant for CLUE. The media switching variety of Topo-Mixer is described in section 3.6.2 of [I-D.ietf-avtcore-rtp-topologies-update]. In this topology, the mixer provides one or more conceptual sources selecting one source at a time from the original sources. The mixer creates a conference-wide RTP session by sharing remote SSRC values as CSRCs to all conference participants.
The basic scenario for this example is a multipoint conference consisting of some traditional single-camera single-screen endpoints and some 3-camera multi-screen endpoints. Each endpoint receives multiple Capture Encodings that originated from several other endpoints. The multi-screen endpoints show the currently speaking endpoint’s video using a large area of the display screens, and also show other recent speakers in smaller size using less screen space.
Since the middlebox (the mixer) is of the switching variety it is not doing any video composition. The endpoints are responsible for composing video streams to be rendered on the endpoint’s display screens. The mixer sends several Capture Encodings to each endpoint, with those Capture Encodings originally coming from several other endpoints. So each endpoint receives many capture encodings, representing Media Captures that originate at other endpoints. The multi-camera endpoints send multiple Media Captures, while the single-camera endpoints send just one Media Capture. Each Media Capture could have multiple Capture Encodings, however.
The mixer selects which original sources it sends to the endpoints based on speech activity, using a policy defined by the mixer.
When completed, this example should be added to the examples in the Framework.
From the human user’s point of view, this example is a more specific case of the general multipoint scenario in [ref use cases]. Consider a conference with these endpoints:
Endpoint A – 4 screens, 3 cameras
Endpoint B – 3 screens, 3 cameras
Endpoint C – 3 screens, 3 cameras
Endpoint D – 3 screens, 3 cameras
Endpoint E – 1 screen, 1 camera
Endpoint F – 2 screens, 1 cameras
Endpoint G – 1 screen, 1 camera
This example focuses on what the user in one of the 3-camera multi-screen endpoints sees. Call this person User A, at Endpoint A. There are 4 large display screens at Endpoint A. Whenever somebody at another site is speaking, all the video captures from that endpoint are shown on the large screens. If the talker is at a 3-camera site, then the video from those 3 cameras fills 3 of the screens. If the talker is at a single-camera site, then video from that camera fills one of the screens, while the other screens show video from other single-camera endpoints.
User A can also see video from other endpoints, in addition to the current talker, although much smaller in size. Endpoint A has 4 screens, so one of those screens shows up to 9 other Media Captures in a tiled fashion.
User B at Endpoint B sees a similar arrangement, except there are only 3 screens, so the 9 other Media Captures are spread out across the bottom of the 3 displays, in a picture-in-picture (PIP) format.
When somebody at a different endpoint becomes the current talker, then User A and User B both see the video from the new talker appear on their large screen area, while the previous talker takes one of the smaller tiled or PIP areas. The person who is the current talker doesn’t see themselves, they see the previous talker in their large screen area.
TBD - Diagrams go here
The Media Producer in the mixer sends a CLUE Advertisement to each endpoint in the conference. There are different possibilities for how the mixer might construct advertisements.
The Producer in the mixer can advertise one Capture Scene, with many Capture Scene Entries (CSE), each with a different number of Media Captures. Say the Producer wants to send up to 12 Media Captures, it could advertise one CSE with 12 switched captures, one with 11, one with 10, etc. These switched Media Captures are distinct from the Media Captures sent from the endpoints. But these switched media captures get their media from those endpoint Media Captures (really their encodings).
A Consumer could then pick the CSE that had the number of Media Captures the Consumer wants to receive.
Questions:
The Producer could advertise multiple scenes, each one representing a different level in the recent talker list. Each Scene could have a CSE with 3 Media Captures.
Scene 1: current and most recent talkers
Scene 2: next most recent talkers
Scene 3: next most recent talkers
Scene 4: next most recent talkers
The grouping of Media Captures, 3 at a time, into the CSEs in each Scene indicates the mixer is responsible for maintaining a useful spatial relationship between the original source Media Captures it switches into these conceptual Media Captures. The mixer provides spatial information, probably using “no scale” coordinates.
The mixer should use the priority attribute to indicate the Media Captures in Scene 1 are highest priority, Scene 2 is next highest, and so on.
Questions:
What other ways should be considered?
This section describes how the Endpoint Consumer selects Media Captures from the advertisement.
The multi-screen Consumer knows it wants to receive 12 captures, so it picks the Media Captures in the CSE that has 12 captures. A single screen endpoint might choose to receive only 1 capture, or maybe 3 or 4, depending on how it wants to render video for showing to the user.
The multi-screen Consumer knows it wants to receive 12 captures, so it picks the Media Captures with the highest priority first (one CSE of 3 captures), then the next highest (another CSE in another Scene with 3 captures), and so on until it has picked 12 captures.
Questions:
For the multiple scene approach, the spatial relationships can be handled in a straightforward manner by the spatial attributes the mixer puts in its advertisement. The mixer can ensure that when it switches media captures from a multi-camera source into its outgoing captures, it puts them together in the correct order that it described in the advertisement. And when it switches captures from single-camera sources, it could also pick multiple single camera sources and assign them to a consistent conceptual spatial relation, even though they don’t have a real physical relationship.
The Consumer renderer assigns the scene with highest priority captures to the largest areas on its display screens, and it assigns each other scene to the smaller areas on its screens. These assigments can remain static, they don’t need to change when the mixer switches between sources for these media captures.
Thanks to Stephan Wenger, Rob Hansen, and Andy Pepperell for contributing to the ideas in this example.
[I-D.ietf-clue-framework] | Duckworth, M., Pepperell, A. and S. Wenger, "Framework for Telepresence Multi-Streams", Internet-Draft draft-ietf-clue-framework-10, May 2013. |
[I-D.ietf-clue-telepresence-requirements] | Romanow, A. and S. Botzko, "Requirements for Telepresence Multi-Streams", Internet-Draft draft-ietf-clue-telepresence-requirements-03, January 2013. |
[I-D.ietf-clue-rtp-mapping] | Even, R. and J. Lennox, "Mapping RTP streams to CLUE media captures", Internet-Draft draft-ietf-clue-rtp-mapping-00, February 2013. |
[I-D.ietf-avtcore-rtp-topologies-update] | Westerlund, M. and S. Wenger, "RTP Topologies", Internet-Draft draft-ietf-avtcore-rtp-topologies-update-00, April 2013. |
[RFC4575] | Rosenberg, J., Schulzrinne, H. and O. Levin, "A Session Initiation Protocol (SIP) Event Package for Conference State", RFC 4575, August 2006. |
[RFC6501] | Novo, O., Camarillo, G., Morgan, D. and J. Urpalainen, "Conference Information Data Model for Centralized Conferencing (XCON)", RFC 6501, March 2012. |