TOC 
Audio/Video Transport WorkingG. Hunt
GroupP. Arden
Internet-DraftBT
Intended status: InformationalJuly 07, 2008
Expires: January 8, 2009 


Monitoring Architectures for RTP
draft-hunt-avt-monarch-00.txt

Status of this Memo

By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as “work in progress.”

The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt.

The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.

This Internet-Draft will expire on January 8, 2009.

Abstract

This memo is intended to stimulate discussion on a hierarchical monitoring architecture for RTP, including a scheme for the definition of lower-layer metrics which are usable by a range of applications. Systematic investigation of a monitoring architecture for RTP/RTCP was requested at the IETF71 (Philadelphia) AVT session.

This first version of the draft is restricted to transport metrics and to a subset of audio application metrics, but it is envisaged that future work should extend this to other applications, principally video.



Table of Contents

1.  Requirements notation
2.  Introduction
3.  Transport layer metrics
    3.1.  Option 1 - Monitoring every packet
    3.2.  Option 2 - Real-time histogram methods
    3.3.  Option 3 - Monitoring by exception
    3.4.  Option 4 - Application-specific monitoring
4.  RTP terminal metrics
5.  Application layer metrics
    5.1.  Requirements for speech quality monitoring metrics
    5.2.  The audio hierarchy
    5.3.  Individual network transport and terminal parameters affecting speech quality
    5.4.  Composite objective speech quality metrics
6.  Choosing transport protocols for metrics
    6.1.  RTCP as a transport for metrics - advantages and disadvantages
        6.1.1.  Advantages of RTCP
        6.1.2.  Disadvantages of RTCP
7.  IANA Considerations
8.  Security Considerations
9.  Acknowledgments
10.  Informative References
§  Authors' Addresses
§  Intellectual Property and Copyright Statements




 TOC 

1.  Requirements notation

This memo is informative and as such contains no normative requirements.



 TOC 

2.  Introduction

The development of multiple metrics for transport and application quality monitoring has been identified as a potential problem for RTP/RTCP interoperability. The AVT group has requested work on an architectural framework for monitoring which recognises that different applications layered on RTP may have some monitoring requirements in common, which should be satisfied by a common design. When this work was initiated, the objective was to design a framework and a small number of re-usable metrics at each appropriate layer to reduce implementation costs and to maximise inter-operability. Since then, work-in-progress on [GUIDELINES] (Ott, J., “Guidelines for Extending the RTP Control Protocol (RTCP),” June 2008.) has stated that RTCP should be used primarily to provide information to peer RTP systems, whilst information used for network management should be carried by out-of-band protocols. By implication, AVT should not work on metrics or their transport in RTCP unless they are motivated by RTP-system-to-RTP-system requirements. However, metrics supporting network and service management are still required for RTP and the applications transported over it, to support many significant real-world deployments.

Service providers may wish to answer some or all of the following:

Metrics of transport performance and application performance, considered either on an isolated per-session basis or as a collection of metrics for multiple sessions using a common network component, can answer or contribute to answers to some or all of these questions.

One example which might lead to a shared metric arises from a shared requirement for monitoring of packet transport, which might be useful for every media type (audio, video, text, messaging) carried over RTP.

Another example is the set of applications all of which transmit audio, including streaming audio speech, streaming music, two-party conversational speech, and audio conferencing. This set of applications might be able to share a suitably defined set of audio metrics, e.g. for parameters such as noise floor, mean level, or amplitude clipping. The subset of interactive speech applications may be able to use common additional metrics related to interactivity (e.g. media delay and echo) which are not applicable to all audio applications. Some or all of these audio metrics may be applicable to the audio channel(s) of a video application, such as IP TV or conversational video.

[Editor's note: need to add a video-based view and examples]

Metrics of RTP transport performance usually relate to single packet network segments, whilst metrics of application performance are more likely to represent the end-to-end connection which may include transmission over non-packet networks and/or over multiple packet networks. Access to, and integration of, multiple sets of packet transport metrics relevant to a single connection typically present difficulties in current networks.

Metrics are typically measured in an RTP systems but may be required at another RTP system or at a non-RTP system. Hence transport of metrics is often required. Metrics might be transported alongside RTP media using the extensibility mechanism defined in [RFC3611] (Friedman, T., “RTP Control Protocol Extended Reports (RTCP XR),” November 2003.) but this is not an input requirement. Other methods may be used if RTCP XR blocks are not suitable or another method offers significant technical advantages. Following the work-in-progress in [GUIDELINES] (Ott, J., “Guidelines for Extending the RTP Control Protocol (RTCP),” June 2008.) which restricts the usage of RTCP, the method for transporting metrics need not be RTCP and should be chosen independently of the metrics themselves. If the transport is not by RTCP, it is likely that multiple transport mechanisms should be permitted, and probably should not be restricted by AVT.

For transport metrics, IETF and other SDOs have defined metrics. There is a wide choice of potentially useful metrics. Some metrics may embed arbitrary design choices, or be application-specific. It is a goal of this work to find generic and re-usable metrics. This may result in a preference for some of the existing metrics over others, or to the definition of alternative metrics meeting the architectural goals of this work.

For metrics at layers higher than transport, metrics are developed by a variety of external SDOs, e.g. by ITU-T for voice telephony applications.

The development of application metrics is an active field. Any framework should be extensible to accommodate useful innovations when there is a consensus for their adoption.

It is obviously desirable to achieve some consensus (the more, the better) on a set of useful metrics (the fewer, the better) which may be widely implemented, widely inter-operated, and widely understood. Large data sets of raw measurements must be condensed into a smaller set of metrics or statistics before any agent (human or machine) can make decisions based on them. It has been suggested that AVT might remain "metric-neutral" by storing and transporting raw measurement data, rather than the condensed metrics (see Option 1 below). Even if data volumes are sufficiently small to make this feasible, some layer must perform the condensation and hence commit to specific metrics.

A four-step process is suggested. The AVT community may wish to contribute to some of these steps.

  1. Choose a set of metrics which is useful for each application.
  2. Classify each member of the sets of metrics according to the architectural layer which they monitor, creating sets of per-application, per-layer metrics.
  3. Define a set of required metrics at each layer as the union of the application-specific sets in each layer. This should include the selection of only one from any group of metrics with overlapping or nearly-overlapping capabilities, leading to agreed sets of per-layer metrics. All of these metrics should be available within the architecture, but each application may select a subset which meets its needs. Most RTP end systems and RTP mixers implement only a subset of possible RTP applications, and clearly these devices need not implement any metric which is relevant only to applications which they do not support.
  4. Choose one or more transport protocols for those cases where metrics are measured at one location but must be available at another, e.g. to cause a reaction in an RTP system's peer, or for network or service management purposes.

The fourth question seems at first sight to be of secondary importance ("We've chosen our metrics, now all we have to do is to transport them") but the choice of transport protocols may be tightly constrained, for example because the measuring point has limited performance and/or limited access bandwidth and/or is in a different trust domain.

Section 3 (Transport layer metrics) describes some options for metrics of transport performance. This includes an initial quantitative investigation of the feasibility of becoming "metric-neutral" by sending raw measurement data rather than condensed metrics.

Section 5 (Application layer metrics) starts the process of describing requirements for application-layer monitoring and the metrics frameworks available to meet them. In this first version of the draft, the description is limited to interactive speech and takes most of its material from the work of ITU-T.

Section 6 (Choosing transport protocols for metrics) discusses the choice of transport protocols, including discussion of the merits of RTCP which remains a candidate protocol.



 TOC 

3.  Transport layer metrics

The objective is to provide a set of metrics which characterise the three transport impairments of packet loss, packet delay, and packet delay variation. These metrics should be usable by any application which uses RTP transport.



 TOC 

3.1.  Option 1 - Monitoring every packet

Most transport metrics, almost by definition, condense a large amount of information about packet arrivals into a small number of statistics. Usually, the aim of the statistics is to present key features of any transport impairments in ways which are readily understood by the operators of the network, with the minimum of distracting additional information. Unfortunately there are multiple ways to condense data about packet arrivals, and the "key features" (those impairments which result in degraded application performance) are likely to be application-dependent. Given this, it is not surprising that there are no known provably optimal metrics for the three transport impairments. There are instead multiple heuristic metrics.

The aim of "monitoring every packet" is to ensure that the information reported is not dependent on the application. In this scheme, RTP systems will report arrival data for each individual RTP packet. RTP (or other) systems receiving this "raw" data may use it to calculate any preferred heuristic metrics, but such calculations and the reporting of the results (e.g. to a session control layer or a management layer) are outside the scope of RTP and RTCP.

Run-length encoding (RLE) is a well-known technique for compressing per-packet information about packet loss. The efficiency of RLE compression is reduced as the packet loss fraction increases, leading to unpredictable metrics data.

If packet round-trip delay is measured using the technique described in [RFC3550] (Schulzrinne, H., “RTP: A Transport Protocol for Real-Time Applications,” July 2003.) section 6.4.1 and [RFC3550] (Schulzrinne, H., “RTP: A Transport Protocol for Real-Time Applications,” July 2003.) Figure 2, the rate of measurement is low (at most one measurement per RTCP measurement cycle) and the volume of data involved in reporting the result is insignificant.

There are no obvious techniques for substantial compression of data related to the arrival times of individual packets, but such data is needed to compute packet delay variation. Hence it appears that an item of data must be sent per packet, if packet delay variation is to be calculated from "raw" data.

The following calculation estimates the volume of data needed to send per-packet data, assuming a simple logarithmic scheme to code the delay variation.

Consider the raw delay variation metric D(1,j) using the notation of [RFC3550] (Schulzrinne, H., “RTP: A Transport Protocol for Real-Time Applications,” July 2003.) section 6.4.1. If delay variation, relative to that of the first packet of the connection, is measured in RTP timestamp units, delay could be coded on a compressed "logarithmic" scale similar to G.711 A-law, which can code with a resolution of 1 unit on the uncompressed chord, and resolutions 2, 4, 8, 16, 32, 64 on each successive more compressed chord to give a range of +/- 2048. This would correspond to +/- 2048/8000s ~ +/- 250ms for 8kHz sampled speech (enough to cover jitter), whilst using 1 byte per packet. Modifications would be needed for other sampling rates. It might be necessary to standardise a timing unit resolution independent of the sampling clock. Specific reserved values could be used to indicate that an expected packet did not arrive.

To estimate data volume, consider a low-bandwidth codec like G.729 with 20ms packetisation. Over a 5s RTCP cycle there will be 250 media packets and 102 bytes/packet (20ms G.729 in RTP/UDP/IP/Ethernet including preamble and Inter-Frame Gap) for a total media layer-2 bandwidth of 25500 bytes/5s (about 40kbit/s). 1 byte per received packet is 250 bytes "raw data" and an overhead of 82 bytes (RTP/UDP/IP/Ethernet, same basis) - say 350 bytes total including some identification (SSRCs etc). This is a fraction 350/25500~1.4% which is within RTP guidelines for RTCP bandwidth. The corresponding calculation for G.711 with 10ms packetisation is 81000 bytes/5s media and a 600-byte "raw transport report" or 0.75%.

However, the use of D(i,j) [RFC3550] (Schulzrinne, H., “RTP: A Transport Protocol for Real-Time Applications,” July 2003.) for estimation of packet delay variation relies on a fixed relationship in the source RTP system between the RTP timestamp and the transmission time of the packet onto the wire. This fixed relationship is not guaranteed even for audio coding and is almost certainly significantly wrong for many video formats, where the RTP timestamp indicates the sampling instant of a frame which may be encoded into multiple packets sent at significantly different times throughout a frame interval. It could be argued that the current RTP framework provides no means for reliable estimation of packet delay variation in general, despite the usefulness of the D(i,j) metric for simple audio streams. This could lead to a conclusion that an RTP-based measure of packet delay variation is not re-usable across RTP applications other than simple VoIP codecs.

Logically, digital signal processors (DSPs) would be used to calculate metrics, including the per-packet data described above. Current advice is that an additional overhead of 600 bytes per channel is needed to store measurement results before periodic transmission, and as such, the per-channel-memory required to support this option will increase memory requirements on infrastructure devices. As memory solutions in currently deployed infrastructure gateways are sized for optimum performance, cost and power, adding this measurement function would result in a reduction of channel density which of course ultimately impacts cost and power. Including additional memory in future designs of course has the same cost and power impacts.

The principle that RTP systems should send per-packet reception report data, and correspondingly that the RTP (or other) system receiving this report data should calculate the metrics of its choice from this data, results in a requirement for computation both at the RTP system which sends the per-packet report and at the RTP (or other) system which receives the report. If DSPs are used to perform this computation in the system which receives the report, there is a further demand on the memory of the DSP devices involved. If general-purpose computing devices are used, then the cost of these devices may be significant. For example, for a 16000 channel trunk media gateway implementing the scheme above and using 10ms packetisation, the gateway must code or decode a total of 3200000 bytes of data per second.

Note that this general method of supplying raw data from the RTP system is the only one which gives the system which receives the data the flexibility to calculate any chosen transport metric for upward reporting. All other methods below either omit or condense data, such that the RTP (or other) system receiving the report is informed only about certain aspects of the transport performance which was measured at the remote RTP system. However the method does not report on the impairment to far-end application that the impairment to outgoing transport caused. For example, it provides no information about far-end jitter buffer events or late packets deemed lost by the application. This is considered further in Section 4 (RTP terminal metrics) below.



 TOC 

3.2.  Option 2 - Real-time histogram methods

There are several potentially useful metrics which rely on the accumulation of a histogram in real time, so that a packet arrival results in a counter being incremented rather than in the creation of a new data item. These metrics may be gathered with a low and predictable storage requirement. Each counter corresponds to a single class interval or "bin" of the histogram. Examples of metrics which may be accumulated in this way include the observed distribution of packet delay variation, and the number of packets lost per unit time interval.

Different networks may have very different expected and achieved levels of performance, but it may be useful to fix the number of class intervals in the reported histogram to give a predictable volume of data. This can be achieved by starting with small class intervals ("bin widths") and automatically increasing the width (e.g. by factors of two) if outliers are seen beyond the current upper limit of the histogram. Data already accumulated may be assigned unambiguously to the new set of bins, given some simple conditions on the relationship between the old and new origins and bin widths.

A significant disadvantage of the histogram method is the loss of any information about time-domain correlations between the samples which build the histogram. For example, a histogram of packet delay variation provides no indication of whether successive samples of packet delay variation were uncorrelated, or alternatively that the packet delay variation showed a highly-correlated low-frequency wander.



 TOC 

3.3.  Option 3 - Monitoring by exception

An entity which both monitors the packet stream, and has sufficient knowledge of the application to know when transport impairments may have degraded the application's performance, may choose to send exception reports containing details of the transport impairments to a receiving system. The crossing of a transport impairment threshold, or some application-layer event, would trigger such reports. RTP end systems and mixers are likely to contain application implementations which may, in principle, identify this type of exception.

It is likely that RTP translators will not contain suitable implementations which could identify such exceptions.

On-path devices such as routers and switches are not likely to be aware of RTP at all. Even if they are aware of RTP, they are unlikely to be aware of the RTP-level performance required by specific applications, and hence they are unlikely to be able to identify the level of impairment at which exceptional transport conditions may start to affect application performance.

This type of monitoring typically requires the storage of recent data in a FIFO (e.g. a circular buffer) so that data relevant to the period just before and just after the exception may be reported. It is not usually helpful to report transport data only from the period following an exception event detected by an application. This imposes some storage requirement (though less than needed for Option 1). It also implies the existence of additional cross-layer primitives or APIs to trigger the transport layer to generate and send its exception report. Such a capability might be considered architecturally undesirable, in that it complicates one or more interfaces above the RTP layer.



 TOC 

3.4.  Option 4 - Application-specific monitoring

This is a business-as-usual option which suggests that the current approach should not be changed, based on the idea that previous application-specific approaches such as that of [RFC3611] (Friedman, T., “RTP Control Protocol Extended Reports (RTCP XR),” November 2003.) were valid. If a large category of RTP applications (such as VoIP) has a requirement for a unique set of transport metrics, arising from its different requirements of the transport, then it seems reasonable for each application category to define its preferred set of metrics to describe transport impairments. We expect that there will be few such categories, probably less than 10.

It may be easier to achieve interworking for a well-defined set of application-specific metrics than it would be in the case that applications select a profile from a palette of many independent re-usable metrics.



 TOC 

4.  RTP terminal metrics

By "RTP terminal metrics" we mean metrics relating to the way a terminal deals with transport impairments affecting the incident RTP stream. These may include de-jitter buffering, packet loss concealment, and the use of redundant streams (if any) for correction of error or loss.

An examples of such a metric is a count of packets arriving too late to be played out at current de-jitter buffer settings.



 TOC 

5.  Application layer metrics



 TOC 

5.1.  Requirements for speech quality monitoring metrics

RTP transport can be used for different application types such as IP (including public internet) and non-IP. It can also apply to different user group sizes running over networks ranging in size from a small closed user group through an enterprise system to national and international networks. Engineering judgment is required to choose the most suitable set of speech quality monitoring metrics for the type of application and the size of the network the application is running on. Some metrics are more suitable for monitoring service level agreements (SLAs), others may be required for regular routine monitoring, and still others may be required for fault diagnosis. The resolution of the metrics may also be different for different types of monitoring. These considerations make it difficult to propose a "one size fits all" set of metrics. However some general points can be made and it is also useful to propose a minimum set of metrics.

Mean Opinion Score (MOS) speech quality metrics such as MOS-LQO for listening quality and MOS-CQO for conversation quality (see later section for further discussion of MOS metrics) are useful for measuring end-to-end speech quality. However they typically require significant time and processing power to produce a result and some MOS-LQO test methods require test calls that consume bandwidth. This rules out MOS metrics for frequent large-scale monitoring. Also methods for measuring conversational MOS are not yet mature enough for VoIP monitoring applications, even although many vendors are using an E-model [G.107] (ITU-T, “Recommendation G.107. The E-model, a computational model for use in transmission planning.,” March 2005.) approach in the absence of anything else. This only leaves MOS-LQO as an overall composite speech quality metric, and, being a listening-only metric, it does not take account of interactive effects such as fixed delay and echo. However, MOS-LQO is often used for SLAs and usually provides a better estimate of what a user actually experiences, than a single network or terminal metric or a group of such metrics. However, a poor MOS score by itself gives little indication of the cause of a problem, and further metrics are required for diagnostic purposes.

A proposed minimum set of metrics with suggested resolutions is as follows:



MetricResolutionRange
MOS-LQO 0.1 MOS 1 to 5
Received speech level 0.1 dB -60 to +10
Received noise level 0.5 dB -130 to +10
Echo return loss 0.1 dB 6 to 40
Round trip delay 1 ms 1 ms to 65 s
Packet delay variation or jitter 1 ms 1 ms to 65 s
Packet loss 1 packet 0 to 2^24

 Table 1 

[Editor's note: More detail required here in a future draft to add information about meaningful measurement durations and whether measurements should include mean and peak values etc. Also require some discussion around "second level" metrics such as jitter buffer parameters for diagnosis of more complicated problems.]

Note that some voiceband data applications running over the same transport network as voice applications may require much lower values of packet loss and packet delay variation than would be required for voice applications alone.

A reporting system for these metrics should be capable of accommodating intermediate network and terminal parameters as well as end-to-end quality metrics for both monitoring and diagnostic purposes.

This minimum set of metrics should allow a wide range of problems to be diagnosed particularly if metrics are available at intermediate points in the network as well as at the endpoints. Echo return loss and delay can be used to establish whether echo is a problem (which would not affect the MOS-LQO score as this is a listening only measurement). Poor MOS-LQO scores could be caused by several factors, but individual measures of packet loss, jitter and noise levels could be used to establish the presence or absence of these degradations. Finally, the level of received speech gives an indication of whether the operating point is correct and whether possible distortion or poor signal-to-noise are causing problems.

The codec type will often be known and this can also be very useful for diagnostic purposes if information about typical MOS scores and susceptibility to packet loss is known for example. Knowledge of network topology is also very useful and can give an indication of possible bandwidth bottlenecks for example.



 TOC 

5.2.  The audio hierarchy

The audio hierarchy can be broadly split into listening (one-way) and conversation (two-way, or multi-way conferencing) applications. These categories can be further split as shown in Figure 1 (The audio hierarchy). In addition, ITU-T has defined a number of bandwidth categories; narrowband (300 to 3400 Hz), wideband (50 to 7000 Hz), super wideband (50 to 14000 Hz) and full band (20 to 20,000 Hz).



                     Audio
                       |
	       ----------------------
        Listening             Conversation
	      |                      |
    -------------              -----------
   |             |            |           |
Streaming  Non-streaming   Two-way  Conferencing
                                          |
                                   -------------
                                  |             |
                            Non-spatial       Spatial
 Figure 1: The audio hierarchy 

The following sections concentrate on one-way (listening only) and two-way (conversational) telephony applications, for which several composite speech quality metrics exist in ITU-T Recommendations. Similar considerations could apply to other applications such as conferencing and this should be addressed in further drafts. Suitable metrics for spatial conferencing are more difficult to derive at this stage since the technology is still relatively new.



 TOC 

5.3.  Individual network transport and terminal parameters affecting speech quality

Parameters affecting both listening and conversation quality include:

Listening levels that are either too quiet or too loud can be unpleasant and make communication difficult.

High noise levels can make listening difficult and in a conversation, high background noise levels may cause a speaker to raise their voice level so that they can hear themselves above the noise.

Certain types of signal distortion such as amplitude clipping can be very unpleasant.

Syllable clipping occurs when the speech at the start or end of a syllable is missing and can cause words to be misunderstood.

Voice activity detection is used to sense periods of voice inactivity and then transmit them as silence periods to reduce bandwidth. Artificial noise (comfort noise) is then injected on the receiving side of a connection to mask the silence caused by the voice activity detection. Without the comfort noise injection the listener might think that the connection had died. However, the contrast between comfort noise and transmitted background noise may be unpleasant for the listener if the comfort noise has not been well matched to the background noise.

Packet delay variation caused by the underlying transport has to be "smoothed out" by using a jitter buffer to temporarily store received speech and then play it back at a uniform rate. Jitter buffers that are too short or have been incorrectly implemented may cause packet loss, or "stuttering" of speech, and jitter buffers delays that are too long unduly add to the overall delay of a connection. For speech or music applications (not data) adaptive jitter buffers that reduce delay as much as possible whilst minimising the risk of packet loss are preferable. However buffer length adaptations must be carefully managed to ensure they are inaudible. This is usually achieved by ensuring that such adaptations occur during silence intervals.

Finally packet loss causes temporary loss of the signal that may become unintelligible as a result.

In addition, a good conversational experience requires interactivity between parties which in turn requires low delay, low echo applications. So some additional parameters affecting conversation quality can be listed as follows:

Long delays affect interactivity and can cause one party to think that the other party is being "very slow" in answering. In extreme cases, very long delays can be very confusing and can cause one party to talk over the other party. The only way round this problem is for the conversation to become half duplex where each party takes it in turns to speak, and each makes it clear when they have finished speaking. Echo can either be caused by electrical reflections at a 2-wire to 4-wire converter or by acoustic or mechanical transmission paths between microphone and earphone. The latter effect is known as terminal coupling loss. Talker echo cause the speaker to hear an echo of his own voice and can be very confusing. Listener echo is generally less common and occurs when the listener hears an echo of the speaker's voice. Short delays cause the signal to sound hollow or slightly reverberant, whilst longer delays cause a distinct echo or echoes.

Echo cancellers are used to minimise echo, but can cause other problems if not carefully designed. For example, periods of double-talk where both parties are speaking at the same time may cause the canceller to diverge and produce echo.

Sidetone is local feedback from the speaker's microphone to their earpiece, which lets them know that the connection is still "live". Without this feedback, the connection would sound "dead", which would be confusing. The level, frequency response and distortion of the sidetone can all affect the user's experience.



 TOC 

5.4.  Composite objective speech quality metrics

In addition to the individual "network" or "terminal" metrics described in the previous section, there are several composite speech quality metrics for objectively measuring end-to-end overall speech quality, based on a 5-point scale defined as follows:

Where

A measurement using the scale just described results in a Mean Opinion Score (MOS), which represents the mean of several opinions obtained from a subjective test. Mean opinion score terminology is defined in [P.800.1] (ITU-T, “Recommendation P.800.1, Mean Opinion Score (MOS) terminology,” July 2006.).

The composite speech quality metrics are useful for commissioning and Service Level Agreements (SLAs), but (as previously discussed) further additional diagnostic information is required when these metrics fall below threshold values.

Composite objective speech quality metrics can be divided into listening quality (MOS-LQO) and conversational quality (MOS-CQO). The ITU-T has produced several recommendations for measuring these composite speech quality metrics [P.561] (ITU-T, “Recommendation P.561, In-service non-intrusive measurement device - Voice service measurements,” July 2002.), [P.562] (ITU-T, “Recommendation P.562. Analysis and interpretation of INMD voice-service measurements,” May 2004.), [P.563] (ITU-T, “Recommendation P.563. Single-ended method for objective speech quality assessment in narrow-band telephony applications,” May 2004.), [P.564] (ITU-T, “Recommendation P.564. Conformance testing for narrowband voice over IP transmission quality assessment models,” November 2007.), [P.862] (ITU-T, “Recommendation P.862. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” February 2001.), [P.862.1] (ITU-T, “Recommendation P.862.1. Mapping function for transforming P.862 raw result scores to MOS-LQO,” November 2003.), and [P.862.2] (ITU-T, “Recommendation P.862.2. Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs,” November 2007.). A hierarchy of the various ITU speech quality test methods is shown in Figure 2.




             Objective speech quality test methods
                              |
                   -----------------------
                  |                       |
              Listening             Conversation
                  |                       |
         -----------------                |
        |                 |               |
   Intrusive        Non-intrusive        INMD
  Double-ended      Single-ended     P.561,P.562
        |                 |               |
        |            -----------          |
      PESQ         P.563      P.564     P.CQO
P.862, P.862.1   Estimate    Estimate   under
    P.862.2        based     based on   development
  WB extension   on speech   IP n/work
        |         payload    parameters
        |
     P.OLQA
     Under
  Development

 Figure 2: Hierarchy of ITU Speech quality test methods 

Double-ended test methods (P.862/P.862.1/P.862.2) rely on a reference signal that is injected at one end of the network and then captured at the other end of the network. The reference and degraded signal are compared and an auditory transform that models the human hearing system is then applied to produce the final MOS value. In contrast, single-ended systems do not require a reference signal and rely solely on the speech payload (eg P.563) or on IP network parameters (eg P.564). P.563 measures several individual characteristics of the received speech signal and then combines the results to form a MOS-LQO, which has been verified against subjectively scored degraded speech files. P.564 uses several IP network parameters and permitted RTCP-XR data to again produce a MOS-LQO. In general double-ended methods are more accurate because they have a reference signal against which to compare the degraded signal.

P.561 describes an In-service Non-intrusive Measurement Device (INMD) for making in-service measurements of several voice and network parameters, which can then be used to produce a conversational mean opinion score as described in P.562. However the algorithm in P.562 was originally intended for TDM rather than IP applications and therefore can only be applied to situations where the impact of IP impairments is negligible. The term "In-service" means that the measurements are made during real customer calls.

In addition to the recommendations already mentioned, there is also a planning tool called the E-Model described in another ITU-T recommendation [G.107] (ITU-T, “Recommendation G.107. The E-model, a computational model for use in transmission planning.,” March 2005.). This was not designed for monitoring applications, but has unfortunately been mis-used for this purpose by several vendors.

Another objective measurement tool is described in an ITU-R Recommendation [BS.1387] (ITU-R, “Recommendation BS.1387. Method for objective measurements of perceived audio quality,” November 2001.). Perceptual Evaluation of Audio Quality (PEAQ) has generally been optimised for the assessment of music signals rather than speech and is applicable to high-quality coded audio systems as used by broadcasters for example.

The listening quality methods already mentioned (P.862/P.862.1/P.862.2, P.563 and P.564) all produce MOS-LQO values as their primary outputs and either require speech as an input or individual network parameters in the case of P.564. Each can be used at intermediate, or end-points of the network provided that appropriate interfaces are available. Except in the case of P.564, these methods either require computational power at the measurement point, or the speech file has to be captured and sent to a server for processing. In the latter case, the size of the speech file is too large for transport by RTCP. By contrast, a P.564 MOS-LQO calculation only relies on packet header information and permitted information from RTCP-XR ie relatively lightweight data.

P.561/P.562 is the only ITU conversational monitoring method (although P.CQO is under development) and it requires the following parameters to be measured:

And at least one of

Class D INMDs [P.561] (ITU-T, “Recommendation P.561, In-service non-intrusive measurement device - Voice service measurements,” July 2002.) for IP applications are required to implement the following functions:

and are required to measure packet delay variation and IP packet loss ratio.

P.562 uses these input parameters to calculate a MOS-CQO score. However as already mentioned the algorithm is at present suitable only for situations where the impact of IP impairments is negligible.



 TOC 

6.  Choosing transport protocols for metrics

Metrics related to RTP sessions are measured by RTP systems but may use any convenient transport mechanism "horizontally" to other RTP systems or "northbound" to session control or management systems, e.g. RTCP XR [RFC3611] (Friedman, T., “RTP Control Protocol Extended Reports (RTCP XR),” November 2003.), SNMP [RFC3410] (Case, J., “Introduction and Applicability Statements for Internet Standard Management Framework,” December 2002.), as SIP [RFC3261] (Rosenberg, J., “SIP: Session Initiation Protocol,” June 2002.) headers or attachments, or TR-069 mechanisms [DSLF‑TR‑069] (DSL Forum, “TR-069 CPE WAN Management Protocol v1.1,” December 2007.).



 TOC 

6.1.  RTCP as a transport for metrics - advantages and disadvantages

RTCP XR remains at least as a candidate transport protocol for metrics, though note that [GUIDELINES] (Ott, J., “Guidelines for Extending the RTP Control Protocol (RTCP),” June 2008.) states explicitly that "The amount of information going into RTCP reports should primarily target the peer (and thus include information that can be meaningfully reacted upon). Gathering and reporting statistics beyond this is not an RTCP task and should be addressed by out-of-band protocols".

If RTCP is used, AVT need define only a generic means to transport arbitrary payloads. Such a means is already available in the form of RTCP XR block types [RFC3611] (Friedman, T., “RTP Control Protocol Extended Reports (RTCP XR),” November 2003.). If the data is self-describing, e.g. based on ASN.1 [X.680] (ITU-T, “Recommendation X.680, Abstract Syntax Notation One (ASN.1): Specification of basic notation,” July 2002.) or XML [XML] (W3C, “Extensible Markup Language (XML) 1.0 (Fourth Edition),” September 2006.), or if usage is standardised in profiles, it would be possible to transmit many different collections of data whilst using only a small number of codepoints from the limited namespace of XR report block types. As a minimum, only one XR block type codepoint need be allocated per SDO, with delegation to the SDO to manage a namespace defined by a type field in the payload. The measurements of round-trip delay and packet loss could still use the established mechanisms from RFC 3550.

This approach is analogous to the definition of codec payload formats for RTP. A specification could define how metrics payloads are carried in RTCP, and how SDP (including offer/answer) is used to request an RTP system to send a metrics payload. The approach decouples the RTCP base protocol (transport format, routing, and transmission rate rules, and RTCP's base metrics) from less generic use cases.



 TOC 

6.1.1.  Advantages of RTCP

RTCP uses the same transport as the RTP media path and hence if media may be transmitted, it is likely that RTCP may also be transmitted - although for connections not using [RTPRTCPMUX] (Perkins, C., “Multiplexing RTP Data and Control Packets on a Single Port,” August 2007.), this is subject to possible difficulties with NAT and firewall devices which may sometimes not open a port for RTCP.

RTCP uses the same transport as the RTP media path so will normally experience the same transport performance as that experienced by the RTP media packets. Firstly this allows an RTCP-based mechanism to make a representative measurement of round-trip delay. Secondly, if QoS mechanisms such as expedited forwarding (EF) have been implemented in support of the RTP media traffic, the transport is likely to be low-delay and possibly also low-loss, compared with a best-efforts class.

Existing transport devices (for example, SBCs, BGWs, NAT) have often been implemented to allow RTCP to transit transparently on next higher UDP port. The devices are unlikely to pass another protocol for the transport of metrics without modification. This would make it harder to introduce any non-RTCP protocol for transport of metrics.



 TOC 

6.1.2.  Disadvantages of RTCP

RTCP is usually carried over an unreliable RTP/UDP/IP transport. Any monitoring scheme using RTCP as its transport must be designed to tolerate message loss and duplication.

Bandwidth for the transport of RTCP may be limited. [RFC3550] (Schulzrinne, H., “RTP: A Transport Protocol for Real-Time Applications,” July 2003.) explicitly limits the bandwidth consumed by RTCP traffic to 5% of the bandwidth used by RTP media. Even without this limitation, the volume of traffic which is allowed access to EF queues may be policed, such that large fractions of RTCP traffic might result in high loss for both the RTCP traffic and for RTP media.



 TOC 

7.  IANA Considerations

None.



 TOC 

8.  Security Considerations

This document itself contains no normative text and hence should not give rise to any new security considerations, to be confirmed.

[Editor's note - should this section consider security merits/demerits of proposals for alternative protocols to RTCP?]



 TOC 

9.  Acknowledgments

This document was originally motivated by ideas from Colin Perkins. The authors would like to thank Graeme Gibbs at BT, and Debbie Greenstreet and her TI colleagues for their review comments.



 TOC 

10. Informative References

[BS.1387] ITU-R, “Recommendation BS.1387. Method for objective measurements of perceived audio quality,” November 2001.
[DSLF-TR-069] DSL Forum, “TR-069 CPE WAN Management Protocol v1.1,” December 2007.
[G.107] ITU-T, “Recommendation G.107. The E-model, a computational model for use in transmission planning.,” March 2005.
[GUIDELINES] Ott, J., “Guidelines for Extending the RTP Control Protocol (RTCP),” ID draft-ott-avt-rtcp-guidelines-01, June 2008.
[P.561] ITU-T, “Recommendation P.561, In-service non-intrusive measurement device - Voice service measurements,” July 2002.
[P.562] ITU-T, “Recommendation P.562. Analysis and interpretation of INMD voice-service measurements,” May 2004.
[P.563] ITU-T, “Recommendation P.563. Single-ended method for objective speech quality assessment in narrow-band telephony applications,” May 2004.
[P.564] ITU-T, “Recommendation P.564. Conformance testing for narrowband voice over IP transmission quality assessment models,” November 2007.
[P.800.1] ITU-T, “Recommendation P.800.1, Mean Opinion Score (MOS) terminology,” July 2006.
[P.862] ITU-T, “Recommendation P.862. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” February 2001.
[P.862.1] ITU-T, “Recommendation P.862.1. Mapping function for transforming P.862 raw result scores to MOS-LQO,” November 2003.
[P.862.2] ITU-T, “Recommendation P.862.2. Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs,” November 2007.
[RFC3261] Rosenberg, J., “SIP: Session Initiation Protocol,” RFC 3261, June 2002.
[RFC3410] Case, J., “Introduction and Applicability Statements for Internet Standard Management Framework,” RFC 3410, December 2002.
[RFC3550] Schulzrinne, H., “RTP: A Transport Protocol for Real-Time Applications,” RFC 3550, July 2003.
[RFC3611] Friedman, T., “RTP Control Protocol Extended Reports (RTCP XR),” RFC 3611, November 2003.
[RTPRTCPMUX] Perkins, C., “Multiplexing RTP Data and Control Packets on a Single Port,” ID draft-ietf-avt-rtp-and-rtcp-mux-07, August 2007.
[X.680] ITU-T, “Recommendation X.680, Abstract Syntax Notation One (ASN.1): Specification of basic notation,” July 2002.
[XML] W3C, “Extensible Markup Language (XML) 1.0 (Fourth Edition),” September 2006.


 TOC 

Authors' Addresses

  Geoff Hunt
  BT
  Orion 1 PP9
  Adastral Park
  Martlesham Heath
  Ipswich, Suffolk IP5 3RE
  United Kingdom
Phone:  +44 1473 608325
Email:  geoff.hunt@bt.com
  
  Philip Arden
  BT
  Orion 3/7 PP4
  Adastral Park
  Martlesham Heath
  Ipswich, Suffolk IP5 3RE
  United Kingdom
Phone:  +44 1473 644192
Email:  philip.arden@bt.com


 TOC 

Full Copyright Statement

Intellectual Property