Internet DRAFT - draft-romano-dcon-recording
draft-romano-dcon-recording
DISPATCH A. Amirante
Internet-Draft University of Napoli
Expires: June 17, 2013 T. Castaldi
L. Miniero
Meetecho
S P. Romano
University of Napoli
December 14, 2012
Session Recording for Conferences using SMIL
draft-romano-dcon-recording-07
Abstract
This document deals with session recording, specifically for what
concerns recording of multimedia conferences, both centralized and
distributed. Each involved media is recorded separately, and is then
properly tagged. A SMIL [W3C.CR-SMIL3-20080115] metadata is used to
put all the separate recordings together and handle their
synchronization, as well as the possibly asynchronous opening and
closure of media within the context of a conference. This SMIL
metadata can subsequently be used by an interested user by means of a
compliant player in order to passively receive a playout of the whole
multimedia conference session. The motivation for this document
comes from our experience with our conferencing framework, Meetecho,
for which we implemented a recording functionality.
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on June 17, 2013.
Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the
Amirante, et al. Expires June 17, 2013 [Page 1]
Internet-Draft DCON Session Recording December 2012
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3
4. Recording . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1. Audio/Video . . . . . . . . . . . . . . . . . . . . . . . 4
4.2. Chat . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3. Slides . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.4. Whiteboard . . . . . . . . . . . . . . . . . . . . . . . . 11
5. Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1. SMIL Head . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2. SMIL Body . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.1. Audio/Video . . . . . . . . . . . . . . . . . . . . . 16
5.2.2. Chat . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2.3. Slides . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2.4. Whiteboard . . . . . . . . . . . . . . . . . . . . . . 19
6. Playout . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7. Security Considerations . . . . . . . . . . . . . . . . . . . 22
8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22
9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 23
Amirante, et al. Expires June 17, 2013 [Page 2]
Internet-Draft DCON Session Recording December 2012
1. Introduction
This document deals with session recording, specifically for what
concerns recording of multimedia conferences, both centralized and
distributed. Each involved media is recorded separately, and is then
properly tagged. Such a functionality is often required in many
conferencing systems, and is of great interest to the XCON [RFC5239]
Working Group. The motivation for this document comes from our
experience with our conferencing framework, Meetecho, for which we
implemented a recording functionality. Meetecho is a standards-based
conferencing framework, and so we tried our best to implement
recording in a standard fashion as well.
In the approach presented in this document, a SMIL
[W3C.CR-SMIL3-20080115] metadata is used to put all the separate
recordings together and handle their synchronization, as well as the
possibly asynchronous opening and closure of media within the context
of a conference. This SMIL metadata can subsequently be used by an
interested user by means of a compliant player in order to passively
receive a playout of the whole multimedia conference session.
The document presents the approach by sequentially describing the
several required steps. So, in Section 4 the recording step is
presented, with an overview of how each involved media might be
recorded and stored for future use. As it will be explained in the
following sections, existing approaches might be exploited to achieve
this steps (e.g. MEDIACTRL [RFC5567]. Then, in Section 5 the
tagging process is described, by showing how each media can be
addressed in a SMIL metadata file, with specific focus upon the
timing and inter-media synchronization aspects. Finally, Section 6
is devoted to describing how a potential player for the recorded
session can be implemented and what it is supposed to achieve.
2. Conventions
In this document, the key words "MUST", "MUST NOT", "REQUIRED",
"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT
RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as
described in BCP 14, RFC 2119 [RFC2119] and indicate requirement
levels for compliant implementations.
3. Terminology
TBD.
Amirante, et al. Expires June 17, 2013 [Page 3]
Internet-Draft DCON Session Recording December 2012
4. Recording
When a multimedia conference is realized over the Internet, several
media might be involved at the same time. Besides, these media might
come and go asynchronously during the lifetime of the same
conference. This makes it quite clear that, in case such a
conference needs to be recorded in order to allow a subsequent,
possibly offline, playout, these media need to be recorded in a
format that is aware of all the timing-related aspects. A typical
example is a videoconference with slide sharing. While audio and
video have a life of their own, slides changes might be triggered at
a completely different pace. Besides, the start of a slideshow might
occur much later than the start of the audio/video session. All
these requirements must be taken into account when dealing with
session recording in a conference. Besides, it's important that all
the individual recordings be taken in a standard fashion, in order to
achieve the maximum compatibility among different solutions and avoid
any proprietary mechanism or approach that could prevent a successful
playout later on.
In this document, we present our approach towards media recording in
a conference. Specifically, we will deal with the recording of the
following media:
o audio and video streams (in Section 4.1);
o text chats (in Section 4.2);
o slide presentations (in Section 4.3);
o whiteboards (in Section 4.4).
Additional media that might be involved in a conference (e.g. desktop
or application sharing) are not presented in this document, and their
description is left to future extensions.
4.1. Audio/Video
In a conferencing system compliant with [RFC5239], audio and video
streams contributed by participants are carried in RTP channels
[RFC3550]. These RTP channels may or may not be secured (e.g by
means of SRTP/ZRTP). Whether or not these channels are secured,
anyway, is not an issue in this case. In fact, as it is usually the
case, all the participants terminate their media streams at a central
point (a mixer entity), with whom they would have a secured
connection. This means that the mixer would get access to the
unencrypted payloads, and would be able to mix and/or store them
accordingly.
From an high level topology point of view, this is how a recorder for
audio and video streams could be envisaged:
Amirante, et al. Expires June 17, 2013 [Page 4]
Internet-Draft DCON Session Recording December 2012
SIP +------------+ SIP
/----------| XCON AS |--------
/ +------------+ \
/ |MEDIACTRL \
/ | \
+-----+ +-----+ +-----+
| | RTP | | RTP | |
|UA-A +<------------>+Mixer+<------------>+UA-B |
| | | | | |
+-----+ +-++--+ +-----+
| |
RTP UA-A | | RTP UA-B (Rx+Tx)
(Rx+Tx) V V
+----------+
| |
| Recorder |
| |
+----------+
Figure 1: Audio/Video Recorder
[Editors' Note: this is a slightly modified version of the
topology proposed on the DISPATCH mailing list,
http://www.ietf.org/mail-archive/web/dispatch/current/
msg00256.html
where the Application Server has been specialized in an XCON-aware
AS, and the AS<->Mixer protocol is the Media Control Channel
Framework protocol (CFW) specified in [RFC6230].]
That said, actually recording audio and video streams in a conference
may be accomplished in several ways. Two different approaches might
be highlighted:
1. recording each contribution from/to each participant in a
separate file (Figure 2);
2. recording the overall mix (one for audio and one from video, or
more if several mixes for the same media type are available) in a
dedicated file (Figure 3).
Amirante, et al. Expires June 17, 2013 [Page 5]
Internet-Draft DCON Session Recording December 2012
+-------+
| UAC-C |
+-------+
"
C (RTP) "
"
"
v
+-------+ A (RTP) +----------+ B (RTP) +-------+
| UAC-A |===================>| Recorder |<===================| UAC-B |
+-------+ +----------+ +-------+
*
*
*
****> A.gsm, A.h263
****> B.g711, B.h264
****> C.amr
Figure 2: Recording individual streams
+-------+
| UAC-C |
+-------+
"
C (RTP) "
"
"
v
+-------+ A (RTP) +----------+ B (RTP) +-------+
| UAC-A |===================>| Recorder |<===================| UAC-B |
+-------+ +----------+ +-------+
*
*
*
****> (A+B+C).wav, (A+B+C).h263
Figure 3: Recording mixed streams
Of the two, the second is probably more feasable. In fact, the first
would require a potentially vast amount of separate recordings which
would need to be subsequently muxed and correlated to each other.
Besides, within the context of a multimedia conference, most of the
times the streams are already mixed for all the participants, and so
recording the mix directly would be a clear advantage. Such an
Amirante, et al. Expires June 17, 2013 [Page 6]
Internet-Draft DCON Session Recording December 2012
approach, of course, assumes that all the streams pass through a
central point where the mixing occurs: it is the case depicted in
Figure 1. The recording would take place in that point. Such
central point, the mixer (which in this case would also act as the
recorder, or a frontend to it), might be a MEDIACTRL-based [RFC5567]
Media Server. Considering the similar nature of audio and video
(both being RTP based and mixed by probably the same entity) they are
analysed in the same section of this document. The same applies to
tagging and playout as well. It is important to note that in case
any policy is involved (e.g. moderation by means of the BFCP
[RFC4582]) the mixer would take it into account when recording. In
fact, the same policies applied to the actual conference with respect
to the delivery of audio and video to the participants needs to be
enforced for the recording as well.
In a more general way, if the mixer does not support a direct
recording of the mixes it prepares, recording a mix can be achieved
by attaching the recorder entity (whatever it is) as a passive
participant to the conference. This would allow the recorder to
receive all the involved audio and video streams already properly
mixed, with policies already taken into consideration. This approach
is depicted in Figure 4.
+-------+
| UAC |
| C |
+-------+
" ^
C (RTP) " "
" "
" " A+B (RTP)
v "
+-------+ A (RTP) +--------+ A+C (RTP) +-------+
| UAC |===================>| Media |===================>| UAC |
| A |<===================| Server |<===================| B |
+-------+ B+C (RTP) +--------+ B (RTP) +-------+
"
"
" A+B+C (RTP)
"
v
+----------+
| Recorder |
+----------+
*
****> (A+B+C).wav, (A+B+C).h263
Amirante, et al. Expires June 17, 2013 [Page 7]
Internet-Draft DCON Session Recording December 2012
Figure 4: Recorder as a passive participant
Whether or not the mixer is MEDIACTRL-based, it's quite likely that
the AS handling the multimedia conference business logic has some
control on the mixing involved. This means it can request the
recording of each available audio and/or video mix in a conference,
if only by adding the passive participant as mentioned above.
Besides, events occurring at the media level or business logic in the
AS itself allow the AS to take note of timing information for each of
the recorded media. For instance, the AS may take note of when the
video mixing started, in order to properly tag the video recording in
the tagging phase. Both the recordings and the timing events list
would subsequently be used in order to prepare the metadata
information of the audio and video in the overall session recording
description. Such a phase is described in Section 5.2.1.
In a MEDIACTRL Media Server, such a functionality might be
accomplished by means of the Mixer Control Package
[I-D.ietf-mediactrl-mixer-control-package]. At the end of the
conference, URLs to the actual recordings would be made available for
the AS to use. The AS might then subsequently access those
recordings according to its business logic, e.g. to store them
somewhere else (the MS storage might be temporary) or to implement an
offline transcoding and/or mixing of all the recordings in order to
obtain a single file representative of the whole audio/video
participation in the conference. Practical examples of these
scenarios are presented in [I-D.ietf-mediactrl-call-flows].
Of course, if the recording of a mix is not possible or desired, one
could still fallback to the first approach, that is individually
recording all the incoming contributions. It is the case, for
instance, of conferencing systems which don't implement video mixing,
but just rely instead on a switching/forwarding of the potentially
several video streams to each participant. This functionality can
also be achieved by means of the same control package previously
introduced, since it allows for the recording of both mixes and
individual connections. Once the conference ends, the AS can then
decide what to do with the recordings, e.g. mixing them all together
offline (thus obtaining an overall mix) or leave them as they are.
The tagging process would the subsequently take the decision into
account, and address the resulting media accordingly.
4.2. Chat
What has been said about audio and video partially applies to text
chats as well. In fact, just as for audio and video a central mixer
is usually involved, for instant messaging most of the times the
contributions by all participants pass through a central node from
Amirante, et al. Expires June 17, 2013 [Page 8]
Internet-Draft DCON Session Recording December 2012
where they are forwarded to the other participants. It is the case,
for instance, of XMPP [RFC3920] and MSRP [RFC4975] based text
conferences. If so, recording of the text part of a conference is
not hard to achieve either. The AS just needs to implement some form
of logging, in order to store all the messages flowing through the
text conference central node, together with information on the
senders of these messages and timing-related information. Of course,
the AS may not directly be the text conference mixer: the same
considerations apply, however, in the sense that the remote mixer
must be able to implement the aforementioned logging, and must be
able to receive related instructions from the controlling AS.
Besides, considering the possible protocol-agnostic nature of the
conferencing system (as envisaged in [RFC5239]), several different
instant messaging protocols may be involved in the same conference.
Just as the conferencing system would act as a protocol gateway
during the lifetime of the conference (i.e. provide MSRP users with
the text coming from XMPP participants and viceversa), all the
contributions coming from the different instant messaging protocols
would need to be recorded in the same log, and in the same format, to
avoid ambiguity later on.
An example of a recorder for instant messaging is presented in
Figure 5.
+-------+
| UAC-C |
+-------+
^
C (MSRP) " '10.11.24 Hi!'
"
"
v
+-------+ A (XMPP) +----------+ B (IRC) +-------+
| UAC-A |<==================>| Recorder |<==================>| UAC-B |
+-------+ '10.11.26 Hey C' +----------+ '10.11.30 Hey man' +-------+
*
*
* [..]
****> 10.11.24 <User C> Hi!
****> 10.11.26 <User A> Hey C
****> 10.11.30 <User B> Hey man
[..]
Figure 5: Recording a text conference
Amirante, et al. Expires June 17, 2013 [Page 9]
Internet-Draft DCON Session Recording December 2012
The same considerations already mentioned about optional policies
involved apply to text conferences as well: i.e., if a UAC is not
allowed to contribute text to the chat, this contribution is excluded
both from the mix the other participants receive and from the ongoing
recording.
Considerations about the format of the recording are left to
Section 5.2.2. Until then, we just assume the AS has a way to record
text conferences somehow in a format it is familiar with. This
format would subsequently be converted to another, standard, format
that a player would be able to access.
4.3. Slides
Another media typically available in a multimedia conference over the
internet is the slides presentation. In fact, slides, whatever
format they're in, are still the most common way of presenting
something within a collaboration framework. The problem is that,
most of the times, these slides are deployed in a proprietary way
(e.g. Microsoft Powerpoint and the like). This means that, besides
the recording aspect of the issue, the delivery itself of such a
slides can be problematic when considered in a standards based
conferencing framework.
Considering that no standard way of implementing such a functionality
is commonly available yet, we assume that a conferencing framework
makes such slides available to the participants in a conference as a
slideshow, that is, a series of static images whose appearance might
be dictated by a dedicated protocol. For instance, a presenter may
trigger the change of a slide by means of an instant messaging
protocol, providing each authorized participant with an URL from
where to get the current slide with optional metadata to describe its
content.
An example is presented in Figure 6. The presenter has previously
uploaded its presentation converted in a proprietary format. The
presentation has been converted to images and a description of the
new format has been sent back to the presenter (e.g. an XML
metadata). At this point, the presenter makes use of XMPP to inform
the other participants about the current slide, by providing an HTTP
URL to the related image.
Amirante, et al. Expires June 17, 2013 [Page 10]
Internet-Draft DCON Session Recording December 2012
+-----------+
| Presenter |
+-----------+
"
(XMPP) " Current presentation: f44gf
" Current slide number: 4
" URL: http://example.com/f44gf/4.jpg
"
v
+-------+ (XMPP) +----------+ (XMPP) +-------+
| UAC-A |<===================| ConfServ |===================>| UAC-B |
+-------+ +----------+ +-------+
| |
| HTTP GET (http://example.com/f44gf/4.jpg) |
v HTTP GET (http://example.com/f44gf/4.jpg) |
v
Figure 6: Presentation sharing via XMPP
From this assumption, the recording of each slide presentation would
be relatively trivial to achieve. In fact, the AS would just need to
have access to the set of images (with the optional metadata
involved) of each presentation, and to the additional information
related to presenters and to when each slide was triggered. For
instance, the AS may take note of the fact that slide 4 from
presentation "f44gf" of the example above has been presented by UAC
"spromano" from the second 56 of the conference to the second 302.
Properly recording all those events would allow for a subsequent
tagging, thus allowing for the integration of this medium in the
whole session recording description together with the other media
involved. This phase will be described in Section 5.2.3.
4.4. Whiteboard
To conclude the overview on the analysed media, we consider a further
medium which is quite commonly deployed in multimedia conferences:
the shared whiteboard. There are several ways of implementing such a
functionality. While some standard solutions exist, they are rarely
used within the context of commercial conferencing application, which
usually prefer to implement it in a proprietary fashion.
Without delving into a discussion on this aspect, suffices it to say
that for a successful recording of a whiteboard session most of the
times it is enough to just record the individual contributions of
each involved participant (together with the usual timing-related
information). In fact, this would allow for a subsequent replay of
the whiteboard session in an easy way. Unlike audio and video,
Amirante, et al. Expires June 17, 2013 [Page 11]
Internet-Draft DCON Session Recording December 2012
whiteboarding usually is a very lightweight media, and so recording
the individual contributions rather than the resulting mix (as we
suggested in Section 4.1) is advisable. These contributions may
subsequently be mixed together in order to obtain a standard
recording (e.g. a series of images, animations, or even a low
framerate video). An example of recording for this medium is
presented in Figure 7.
+-------+
| UAC-C |
+-------+
"
C (XMPP) " 10.11.20: line
"
"
v
+-------+ A (XMPP) +-----------+ B (XMPP) +-------+
| UAC-A |===================>| WB server |<===================| UAC-B |
+-------+ 10.10.56: circle +-----------+ 10.12.30: text +-------+
*
*
*
****> 10.10.56: circle (A)
****> 10.11.20: line (C)
****> 10.12.30: text (B)
Figure 7: Recording a whiteboard session
The recording process may be enriched by the population of a parallel
event list. For instance, optimizations might include event as the
creation of a new whiteboard, the clearing of an existing whiteboard
or the adding of a background image that replaced the previously
existing content. Such event would be precious in a subsequent
playout of the recorded steps, since they would allow for a more
lightweight replication in case seeking is involved. For instance,
if 70 drawings have been done, but at second 560 of the conference
the whiteboard has been cleared and since then only 5 drawings have
been added, a viewer seeking to second 561 would just need the
clear+5 drawings to be replicated. Anyway, further discussion upon
the tagging process of this media is presented in Section 5.2.4.
5. Tagging
Once the different media have been recorded and stored, and their
Amirante, et al. Expires June 17, 2013 [Page 12]
Internet-Draft DCON Session Recording December 2012
timing related somehow, this information needs to be properly tagged
in order to allow intra-media and inter-media synchronization in case
a playout is invoked. Besides, it would be desirable to make use of
standard means for achieving such a functionality. For these
reasons, we chose to make use of the Synchronized Multimedia
Integration Language [W3C.CR-SMIL3-20080115], which fulfills all the
aforementioned requirements, besides being a well-established W3C
standard. In fact, timing information is very easy to address using
this specification, and VCR-like controls (start, pause, stop,
rewind, fast forward, seek and the like) are all easily deploayble in
a player using the format.
The SMIL specification provides means to address different media by
using custom tags (e.g. audio, img, textstream and so on), and for
each of these media the related tempification can be easily
described. The following subsections will describe how a SMIL
metadata could be prepared in order to map with the media recorded as
described in Section 4.
Specifically, considering how a SMIL file is assumed to be
constructed, the head will be described in Section 5.1, while the
body (with different focus for each media) will be presented in
Section 5.2.
5.1. SMIL Head
As specified in [W3C.CR-SMIL3-20080115], a SMIL file is composed of
two separate sections: a head and a body. The head, among all the
needed information, also includes details about the allowed layouts
for a multimedia presentation. Considering the amount of media that
might have been involved in a single conference, properly
constructing such a section definitely makes much sense. In fact,
all the involved media need to be placed in order not to prevent
access to other concurrent media within the context of the same
recording.
For instance, this is how a series of different media might be placed
in a layout according to different screen resolutions:
Amirante, et al. Expires June 17, 2013 [Page 13]
Internet-Draft DCON Session Recording December 2012
<?xml version="1.0" encoding="UTF-8"?>
<smil xmlns:xml="http://www.w3.org/XML/1998/namespace">
<head>
<switch systemScreenSize="800X600">
<layout>
<root-layout width="800" height="600" background-color="black"/>
<region id="image0" regionName="image" fit="fill" top="310" \
left="370" width="400" height="350" />
<region id="video0" regionName="video" top="0" left="370" \
width="430" height="310" fit="fill" />
<region id="chat0" regionName="chat" fit="fill" alt="chat" \
top="410" left="370" width="400" height="-60"/>
<region id="wb0" regionName="wb" top="0" left="0" width="370" \
height="520"/>
</layout>
</switch>
<switch systemScreenSize="1024X768">
<layout>
<root-layout width="1024" height="768" \
background-color="black"/>
<region id="image1" regionName="image" fit="fill" top="310" \
left="594" width="400" height="350"/>
<region id="video1" regionName="video" top="0" left="594" \
width="430" height="310" fit="fill"/>
<region id="chat1" regionName="chat" fit="fill" alt="chat" \
top="578" left="594" width="400" height="108"/>
<region id="wb1" regionName="wb" top="0" left="0" width="594" \
height="688"/>
</layout>
</switch>
[..]
That said, it's important that this section of the SMIL file be
constructed properly. In fact, the layout description also contains
explicit region identifiers, which are referred to when describing
media in the body section.
TBD. (?)
5.2. SMIL Body
The SMIL head section described previously is very important for what
concerns presentation-related settings, but does not contain any
timing-related information. Such information, in fact, belongs to a
separate section in the SMIL file, the so called body. This body
contains the information on all the involved media in the recorded
session, and for each media timing information are provided. This
Amirante, et al. Expires June 17, 2013 [Page 14]
Internet-Draft DCON Session Recording December 2012
timing information includes not only when each media appears and when
it goes away, but also details on the media lifetime as well. By
correlating the timing information for each media, a SMIL reader can
infer inter-media synchronization and present the recorded session as
it was conceived to appear.
Besides, the involved media can be grouped in the body in order to
implement sequential and/or parallel playback involving a subset of
the available media. This is made possible by making use of the
<seq> and <par> elements. The <par> element in particular is of
great interest to this document, since in a multimedia conference
many media are presented to participants at the same time.
That said, it is important to be able to separately address each
involved medium. To do so, SMIL makes use of well specified
elements. For instance, a <video> element is used to state the
presence of a video stream in the session. Each of these elements
can be furtherly customized and configured by means of ad-hoc
attributes. For instance, the 'src' attribute in a <video> element
means that the actual video stream source can be found at the
provided address.
The element for each media is also the place where SMIL adds
information upon when the addressed media comes into play. This is
done by means of two attributes called 'begin' and 'end'
respectively. As the names themselves suggest, the 'begin' attribute
gives a temporal reference on the media start, while the 'end'
attribute specifies when the media ends. For instance, an element
formatted in the following way:
<video src="http://www.example.com/conference45.avi" region="box12" \
begin="15s" end="400s"/>
means that a video stream (whose URL is provided in 'src') must be
played in the session only 15 seconds after the session beginning,
and that it must end 385 seconds after. This information is also
used when seeking through a session. For instance, if a user
accessing the recording seeks to 200 seconds after the beginning, the
video will appear as well at the relative time of 200-15=185 seconds.
Considering the recorded media presented in Section 4, the
construction of following sections of the body will be described:
Amirante, et al. Expires June 17, 2013 [Page 15]
Internet-Draft DCON Session Recording December 2012
o audio/video streams (in Section 5.2.1);
o text chats (in Section 5.2.2);
o slide presentations (in Section 5.2.3);
o whiteboards (in Section 5.2.4).
5.2.1. Audio/Video
In SMIL, the element to describe an audio stream is <audio>, while
for video the element is <video>. Considering that these two stream
types are handled in a very similar way, only video will be
addressed. This is an explicit choice for two reasons: (i) video is
slightly more complex to address than audio, and so treating video
makes more sense; (ii) often off-line encoders/muxers will place the
recorded elementary audio and video streams in a single video
container, which means both streams can actually be addressed in a
single media file.
That said, <video> is the element used in a SMIL bod to state the
presence of an audio/video stream. It's tempification, related to
other media, might be implemented by making use of a <par>/<seq>
aggregator. In such an element, some attributes are of great
relevance and should be included:
o 'src', to address the actual video file to use (usually a HTTP
URL);
o 'begin' and 'end', for timing information (when the video should
appear/disappear in the session);
o 'region', to specify where the stream will need to appear in the
layout as configured in the head (e.g. place it in the region
called box12).
All these information can easily be taken according to the stream as
recorded previously (optionally re-encoded and/or re-muxed), together
with the timing information as part of the event log. The 'src', in
particular, can be any video file, which means that an encoding of
the stream for a player is quite trivial to achieve.
Besides, as mentioned in Section 4.1, recordings may be available as
already mixed streams, or individual streams. In case the recording
is already mixed, then the tagging can be done as seen in the
previous paragraph:
<video src="http://www.example.com/conference45.avi" region="box12" \
begin="15s" end="400s"/>
Amirante, et al. Expires June 17, 2013 [Page 16]
Internet-Draft DCON Session Recording December 2012
where this element would state the presence of an audio/video stream,
to appear in the specified region in the specified range of time. In
case several recordings are available, instead, the tagging would be
a little more complex: in fact, the metadata would need to address
the parallel playback of the different recordings, which would also
need to reflect the actual lifetime of the original streams in the
conference. For instance, if UAC A joined the conference much before
UAC B, its contributions would appear in the playout accordingly. An
example of how this could be achieved in a SMIL metadata is presented
here:
<par>
[..]
<video src="http://www.example.com/userA.avi" region="box12" \
begin="15s" end="400s"/>
<video src="http://www.example.com/userB.avi" region="box16" \
begin="230s" end="521s"/>
[..]
</par>
This lines tell an interested player that the two specified video
streams (whose URLs are provided in the respective 'src' attributes)
must be played in parallel, and in different regions. Anyway, video
stream 'userA.avi' starts after 15 seconds, while 'userB.avi' starts
after 230 seconds since the beginning of the conference, reflecting
the appearance of these media in the conference itself.
5.2.2. Chat
Text in SMIL can be addressed in several different ways, the most
common ones being <text> and <textstream> elements. <text>, however,
usually deals only with static text content, that is text without
timing information (e.g. HTML). For this reason, <textstream>
should be used instead, since it allows text to appear and disappear
in real-time.
The attributes to configure the element are basically the same as the
one presented for <video> (src, region, begin, end). The difference,
anyway, is on the file to refer to in the 'src' attribute. In fact,
if timing information is needed, a proper format for tempified text
is needed. The <textstream> element supports RealText Markup, which
is a separate markup language for dealing with real-time text. It is
the format used, for instance, for subtitle captioning. An example
of RealText is presented in the following lines:
Amirante, et al. Expires June 17, 2013 [Page 17]
Internet-Draft DCON Session Recording December 2012
<window width="340" height="160" wordwrap="true" loop="false" \
bgcolor="white">
<font color="black" face="Arial" size="+0">
<Time begin="0:00:02.2"/><br/><User C>Hi
<Time begin="0:00:04.5"/><br/><User A>Hey C
<Time begin="0:00:08.1"/><br/><User B>Hey man
[..]
This example recalls Figure 5, where the first message (by User C)
was sent at 10.11.24. Assuming the text conference started at
10.11.22, the log is converted to RealText and tagged accordingly
(e.g. User C saying his first message two seconds after the
conference started). The RealText fine can then be addressed in SMIL
using the aforementioned <textstream> element:
<par>
[..]
<textstream src="http://example.com/chats/conf45.rt" region="chat" \
begin="0s" end="500s"/>
[..]
</par>
Once the requirement on the file format is assessed, the next step is
obvious. Whatever format the chat in the conference had been
recorded into, it needs to be converted to a RealText file in order
to have it addressed in the resulting SMIL file. The conversion is
usually very trivial to achieve, considering that chat logs often
have the same information needed in a RealText file except for the
presentation format.
5.2.3. Slides
The easiest way to deal with a slideshow and/or a shared slide
presentation is to make use of the <img> element. In fact, as
anticipated in Section 4.3, slides in a presentation most often are
composed of a static content, and can be assimilated to images. This
means that addressing a complete presentation in a SMIL file can be
achieved by following these steps:
1. preparing a list of images reflecting the original presentations
(e.g. 10 images for 10 slides, or more if any animation was
involved);
Amirante, et al. Expires June 17, 2013 [Page 18]
Internet-Draft DCON Session Recording December 2012
2. prepare the timing related information (e.g. when slide 1
appeared, and when it was substituted by slide 2);
3. placing a series of <img> elements in the SMIL metadata to
address the first two steps.
An example of this, recalling the scenario depicted in Figure 6, is
presented here:
<par>
[..]
<img src="http://www.example.com/f44gf/1.jpg" region="image" \
begin="0s" end="10s"/>
<img src="http://www.example.com/f44gf/2.jpg" region="image" \
begin="10s" end="18s"/>
<img src="http://www.example.com/f44gf/3.jpg" region="image" \
begin="18s" end="30s"/>
[..]
</par>
The slideshow would usually be a sequence, and so a <seq> would seem
the more apt way to address the presentation sharing. Nevertheless,
timing information are very important, and it's quite likely that
several additional media will flow in parallel with the slides (e.g.
the video stream which includes the presenter talking). That's why a
<par> element is used instead, which for brevity omits the other
media involved.
5.2.4. Whiteboard
As anticipated in Section 4.4, no standard solution is usually
deployed when talking of whitebording in a conferencing system. For
this reason, the recording process suggested in Section 4.4 is just a
timing-aware dump of all the interactions occurred at the whiteboard
level. These interactions might subsequently be converted in a more
common format as, for instance, a video or an image slide show. In
case of a video, the same considerations of Section 5.2.1 would
apply, since the whiteboard recording would actually be a video
itself. In case it is converted to a slideshow, the tagging process
would occur as explained in Section 5.2.3.
However, SMIL also allows for custom, non-standard media to be
involved in its metadata. This can be achieved by means of the
standard element <ref>, which is a generic media reference. This
element allows for the description and addressing of non-standard
media (or at least media the chosen SMIL specification is not aware
Amirante, et al. Expires June 17, 2013 [Page 19]
Internet-Draft DCON Session Recording December 2012
of), which could be implemented in a custom player. This means that,
if a whiteboard has been recorded in a proprietary way, and this way
needs for a reason or for another to be preserved, the <ref> element
may be used to address it: in fact, the same attributes previously
introduced (including 'src' and the others) are available to this
element as well. Of course, if this approach is used only a player
able to understand the proprietary media extension would be able to
replay the recorded whiteboard session. To make the player aware of
the format employed, a 'type' attribute could be added as well.
An example of how the recorded whiteboard might be addressed is
provided here:
<par>
[..]
<ref src="http://example.com/wb/wb12.txt" region="wb" \
type="myFormat"/>
[..]
</par>
6. Playout
Once the SMIL metadata has been properly prepared, a playout of the
recorded conference is not difficult to achieve. In fact, an
interested user just needs to get a SMIL-aware player supporting the
several file formats involved, that are: (i) audio/video; (ii)
images; (iii) RealText; (iv) the whiteboarding session, whatever
format it has been recorded into. Considering the standard nature of
SMIL and of almost all the media involved, the session is likely to
be easily accessable to many players out there in the wild. Anyway,
the 'type' attribute for all the involved media can be used to check
for the support of the related media or not.
Additional information provided in the SMIL head (e.g. the <switch>
elements and the <layout> they suggest) provide guidance for players
to presenting the addressed media in the expected way.
The sequence an interested user needs to realize in order to access a
recorded conference session can be summarized in the following
simplified steps:
o the user retrieves the SMIL file associated with the conference
she/he is interested to (e.g. by means of HTTP or other out-of-
band mechanisms);
Amirante, et al. Expires June 17, 2013 [Page 20]
Internet-Draft DCON Session Recording December 2012
o the SMIL file is passed to a compliant media player (which could
have been the means to get the SMIL file in the first place);
o the player parses the SMIL file and checks if all the media are
supported; apart from explicitly non-standard media (e.g.
whiteboard) the player might check if the envolved media files are
encoded in a format it supports (e.g. a video file encoded in
H.264/MP3);
o the player prepares the presentation screen; it makes use of the
information in the <head> in order to choose the right layout; the
choice may be automatic (e.g. according to the screen resolution)
or guided by the user;
o the player starts retrieving each involved media file; it may
either retrieve each file in its completeness, or start
downloading and then start the playout almost immediately (e.g.
buffering); it also listens for user-generated events, like the
user pausing/resuming the playout, or seeking to a specific time
in the conference; if any of these events occur, it takes the
related action (e.g. seeking to the right time for each medium in
the conference, taking the timing information from the SMIL file
as well).
A general overview of the scenario can be seen in Figure 8.
+------+ 1. START +----------+ +----------+
| User |------------>| User |------------------------->| Sessions |
| |<------------| (player) | 2. get conf45.smil | database |
+------+ 6. SHOW +----------+ +----------+
| | |
| | |
| | | 3. get audios and videos +-----------+
| | +---------------------------->| WebServer |
| | | (video) |
| | 4. get RealText files +-----------+
| +------------------------------->| (text) |
| 5. get slide images +-----------+
+---------------------------------->| (images) |
+-----------+
Figure 8: Retrieving and playing a recorded conference session
In this quite oversimplified scenario, an interested viewer chooses
to start viewing a previously recorded conference. She/he knows the
address to the recorded session (http://example.com/conf45.smil) and
passes it to her/his player (1.). Starting the playout triggers the
retrieval of the SMIL description (2.), which may be achieved by
Amirante, et al. Expires June 17, 2013 [Page 21]
Internet-Draft DCON Session Recording December 2012
means of HTTP or any other protocol. Once the player has access to
the description, it starts retrieving the individual media resources
addressed there (video in 3., chat in 4., slides in 5.), and,
according to the implementation of the player, it either waits for
all the downloads to complete or just buffers a little while before
starting the presentation to the user (6.).
7. Security Considerations
TBD.
8. Acknowledgements
The authors would like to thank...
9. References
[RFC2234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", RFC 2234, November 1997.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2434] Narten, T. and H. Alvestrand, "Guidelines for Writing an
IANA Considerations Section in RFCs", BCP 26, RFC 2434,
October 1998.
[RFC3261] Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
A., Peterson, J., Sparks, R., Handley, M., and E.
Schooler, "SIP: Session Initiation Protocol", RFC 3261,
June 2002.
[RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V.
Jacobson, "RTP: A Transport Protocol for Real-Time
Applications", STD 64, RFC 3550, July 2003.
[RFC5567] Melanchuk, T., "An Architectural Framework for Media
Server Control", RFC 5567, June 2009.
[RFC6230] Boulton, C., Melanchuk, T., and S. McGlashan, "Media
Control Channel Framework", RFC 6230, May 2011.
[I-D.ietf-mediactrl-mixer-control-package]
McGlashan, S., Melanchuk, T., and C. Boulton, "A Mixer
Control Package for the Media Control Channel Framework",
Amirante, et al. Expires June 17, 2013 [Page 22]
Internet-Draft DCON Session Recording December 2012
draft-ietf-mediactrl-mixer-control-package-14 (work in
progress), January 2011.
[I-D.ietf-mediactrl-call-flows]
Amirante, A., Castaldi, T., Miniero, L., and S. Romano,
"Media Control Channel Framework (CFW) Call Flow
Examples", draft-ietf-mediactrl-call-flows-10 (work in
progress), November 2012.
[RFC5239] Barnes, M., Boulton, C., and O. Levin, "A Framework for
Centralized Conferencing", RFC 5239, June 2008.
[RFC4582] Camarillo, G., Ott, J., and K. Drage, "The Binary Floor
Control Protocol (BFCP)", RFC 4582, November 2006.
[W3C.CR-SMIL3-20080115]
Bulterman, D., "Synchronized Multimedia Integration
Language (SMIL 3.0)", World Wide Web Consortium CR CR-
SMIL3-20080115, January 2008,
<http://www.w3.org/TR/2008/CR-SMIL3-20080115>.
[RFC3920] Saint-Andre, P., Ed., "Extensible Messaging and Presence
Protocol (XMPP): Core", RFC 3920, October 2004.
[RFC4975] Campbell, B., Mahy, R., and C. Jennings, "The Message
Session Relay Protocol (MSRP)", RFC 4975, September 2007.
Authors' Addresses
Alessandro Amirante
University of Napoli
Via Claudio 21
Napoli 80125
Italy
Email: alessandro.amirante@unina.it
Tobia Castaldi
Meetecho
Via Carlo Poerio 89
Napoli 80100
Italy
Email: tcastaldi@meetecho.com
Amirante, et al. Expires June 17, 2013 [Page 23]
Internet-Draft DCON Session Recording December 2012
Lorenzo Miniero
Meetecho
Via Carlo Poerio 89
Napoli 80100
Italy
Email: lorenzo@meetecho.com
Simon Pietro Romano
University of Napoli
Via Claudio 21
Napoli 80125
Italy
Email: spromano@unina.it
Amirante, et al. Expires June 17, 2013 [Page 24]