Internet DRAFT - draft-romanow-clue-audio-rendering-tag
draft-romanow-clue-audio-rendering-tag
CLUE A. Romanow
Internet-Draft R. Hansen
Intended status: Standards Track Cisco Systems
Expires: December 2, 2012 A. Pepperell
Silverflare
B. Baldino
Cisco Systems
May 31, 2012
The need for audio rendering tag mechanism in the CLUE Framework
draft-romanow-clue-audio-rendering-tag-00
Abstract
The purpose of this draft is for discussion in the CLUE working
group.
It proposes adding an audio rendering tag to the CLUE framework
[I-D.ietf-clue-framework], which makes it possible for the consumer
to correctly render audio with respect to video in a multistream
video conference. The solution proposed is in partial response to
CLUE Task #10, Does Framework provide sufficient info for receiver?
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on December 2, 2012.
Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
Romanow, et al. Expires December 2, 2012 [Page 1]
Internet-Draft Audio rendering tag for CLUE May 2012
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Motivation- the issue . . . . . . . . . . . . . . . . . . . . . 3
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Audio Rendering Tag Mechanism . . . . . . . . . . . . . . . . . 3
4. Use of the RTP header extension . . . . . . . . . . . . . . . . 5
5. Use case note . . . . . . . . . . . . . . . . . . . . . . . . . 6
6. Security Considerations . . . . . . . . . . . . . . . . . . . . 6
7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 6
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 6
9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 7
9.1. Normative References . . . . . . . . . . . . . . . . . . . 7
9.2. Informative References . . . . . . . . . . . . . . . . . . 7
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 7
Romanow, et al. Expires December 2, 2012 [Page 2]
Internet-Draft Audio rendering tag for CLUE May 2012
1. Motivation- the issue
A goal for CLUE audio is that listeners perceive the direction of a
sound source to be the same as that of the visual image of the
source; this is referred to as directional audio. In some situations
the existing clue mechanisms are adequate. The consumer can use the
spatial information to correctly place the audio when the provider
advertisement includes spatial information (point of origin and
capture area) giving a static relationship between both video and
associated audio captures.
However, in some circumstances, for different reasons, the audio
and/or video spatial information is not sent in the provider
advertisement. For instance, the case of a three-screen system
advertising three video captures and one switched audio capture,
where the audio is switched from the loudest of three microphones.
In this case, how will the consumer know how to associate the audio
with the correct video so it can be played out in the correct
location?
Here we suggest a simple mechanism -- audio rendering tagging.
When audio and video cannot be matched through provider advertisement
spatial information, we would like the ability to play out audio on
multiple speakers matching the position of the speaker in the
original scene. Also, the audio may be assigned to a speaker in
real-time. It may need to be mixe locally and played out on any
speaker. For example, if the consumer wants to hear the top 3
speakers, regardless of where they are located remotely, if all 3 top
speakers are coming from the left, then the 3 speakers need to be
mixed, perhaps locally, and played out on the left.
Note: Several typical scenarios are described at the end of this note
in section titled Use Case.
2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119] and
indicate requirement levels for compliant implementations.
3. Audio Rendering Tag Mechanism
We propose an audio tagging mechanism In order to cope with a
changing mapping of the most significant audio and video participants
Romanow, et al. Expires December 2, 2012 [Page 3]
Internet-Draft Audio rendering tag for CLUE May 2012
(i.e., normal MCU operations in the presence of more participants'
media streams that can be rendered simultaneously) and to get audio
played out correctly to multiple speakers. A consumer optionally
tells the provider an audio tag value corresponding to each of its
chosen video captures which enables received audio to be associated
with the correct video stream, even when the set of audible
participants changes. This information is included with the consumer
request so there is no need for additional CLUE message exchanges
(specifically, no additional provider capture advertisements or
consumer requests).
The audio tags are defined in the consumer request as opposed to in a
capture advertised by a producer. The reason for this is that it is
valid for a consumer to request a capture multiple times (with
different encodings, for example) and hence a method is required for
differentiating between these streams.
When the consumer configures the provider, saying which captures it
wants, it also optionally includes an audio tag with each capture
request. For example, VC1, ATag1; VC2, ATag2. When the provider
sends audio packets to the consumer, it includes the appropriate
audio tag in an RTP header extension. For example, if the provider
is sending audio packets that are associated with VC1, it tags the
packets with ATag1. The consumer can then play out the audio in a
position appropriate for video from VC1.
Suppose that several audio streams need to be played out through the
same speaker - for example, the 3 audio streams (AC1, AC2, AC3) need
to be played out at the speaker associated with VC1. The provider
would send:
AC1 ATag1
AC2 ATag1
AC3 ATag1
AC1, AC2 and AC3 are all played out on the same speaker, the audio
output associated with VC1. This takes care of the issue of dynamic
audio output - assigning the right speaker to audio streams.
Figure 1 illustrates an example showing 3 screens, each with a main
video and 3 PIPs. Below each screen is a list of the video captures,
VCs with the associated Audio Tag.
Romanow, et al. Expires December 2, 2012 [Page 4]
Internet-Draft Audio rendering tag for CLUE May 2012
----------------------3 Screens ---------------------
|------------------+- -----------------+------------------Y
| | | |
| VC1 | VC2 | VC3 |
| | | |
| | | |
| | | |
| ''''|'''''''''| | ''''|'''''|'''| | '''''|''''|''''||
| |VC4|.VC5.|VC6| | |VC7|.VC8.|VC9| | |VC10|VC11|VC12||
'------------------+-------------------+-------------------
VC1 VC2 VC3
VC4 Audio Tag 1 VC7 Audio tag 2 VC10 Audio tag 3
VC5 VC8 VC11
VC6 VC9 VC12
Figure 1: Audio rendering tags for 3 screen example
The provider may choose not to include the extension header in an
audio packet, signaling that there is no association between the
current audio and current video (i.e., an audio-only participant).
It may also include more than one audio tag in the extension header,
signaling that this audio is associated with multiple current video
participants, due perhaps to a capture being received multiple times
at different resolutions, or two video captures that both include the
current speaker.
This mechanism also allows multiple audio streams to be associated
with a single video stream (i.e. for a composed video stream); this
simply requires the appropriate audio packets to be tagged with the
same id.
4. Use of the RTP header extension
We propose that audio tags are integer numbers between 0 and 255
optionally set by the consumer per requested capture. This allows up
to 16 tags to be included in a one-byte RTP header extension [RFC
5285]. An example header extension for an audio packet with one tag
follows. The audio tag extension is ID1. The example includes
another header extension (ID0) to show how the proposal would
interact with [I-D.lennox-clue-rtp-usage]:
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 0xBE | 0xDE | length=1 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Romanow, et al. Expires December 2, 2012 [Page 5]
Internet-Draft Audio rendering tag for CLUE May 2012
| ID0 | L=0 | data | ID1 | L=0 | Tag |
-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
RTP ext headers for audio rendering tag and capture ID
The lack of the RTP header extension in a packet means that the audio
packet is not associated with any of the requested video streams that
included audio tags.
5. Use case note
o An endpoint can receive multiple video and audio streams and
render complex layouts locally.
o It may have a wide display area so directional audio is important.
o It may have one loudspeaker per display, or perhaps some entirely
different multi-loudspeaker setup known only to the endpoint
itself.
o The endpoint may therefore have the capability of playing back
audio from a wide range of positions.
o Either from a few fixed zones or with fine granularity.
o Either by routing a sound source to a single loudspeaker, by
panning between pairs of loudspeakers, or by some other advanced
distribution scheme involving several or even all loudspeakers.
6. Security Considerations
TBD
7. Acknowledgements
Thanks to Johan Nielsen for discussions and adding the Use case
note.cuss
8. IANA Considerations
TBD
9. References
Romanow, et al. Expires December 2, 2012 [Page 6]
Internet-Draft Audio rendering tag for CLUE May 2012
9.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
9.2. Informative References
[I-D.ietf-clue-framework]
Romanow, A., Duckworth, M., Pepperell, A., and B. Baldino,
"Framework for Telepresence Multi-Streams",
draft-ietf-clue-framework-05 (work in progress), May 2012.
[I-D.lennox-clue-rtp-usage]
Lennox, J., Witty, P., and A. Romanow, "Real-Time
Transport Protocol (RTP) Usage for Telepresence Sessions",
draft-lennox-clue-rtp-usage-03 (work in progress),
March 2012.
Authors' Addresses
Allyn Romanow
Cisco Systems
San Jose, CA 95134
USA
Email: allyn@cisco.com
Robert Hansen
Cisco Systems
Langley,
UK
Email: rohanse2@cisco.com
Andy Pepperell
Silverflare
Email: andy.pepperell@silverflare.com
Romanow, et al. Expires December 2, 2012 [Page 7]
Internet-Draft Audio rendering tag for CLUE May 2012
Brian Baldino
Cisco Systems
San Jose, CA 95134
USA
Email: bbaldino@cisco.com
Romanow, et al. Expires December 2, 2012 [Page 8]