NETVC (Internet Video Codec) | Y. Cho |
Internet-Draft | Mozilla Corporation |
Intended status: Informational | October 31, 2016 |
Expires: May 4, 2017 |
Applying PVQ Outside Daala
draft-cho-netvc-applypvq-02
This document describes the Perceptual Vector Quantization (PVQ) outside of the Daala video codec, where PVQ was originally developed. It discusses the issues arising while integrating PVQ into a traditional video codec, AV1.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 4, 2017.
Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
Perceptual Vector Quantization (PVQ) [Perceptual-VQ][I-D.valin-netvc-pvq] has been proposed as a quantization and coefficient coding tool for an internet video codec. PVQ was originally developed for Daala video codec <https://xiph.org/daala/> [PVQ-demo], which does a gain-shape coding of transform coefficients instead of more traditional scalar quantization. (The original abbreviation of PVQ, "Pyramid Vector Quantizer", as in [I-D.valin-netvc-pvq] is now commonly expanded as "Perceptual Vector Quantization".)
The most distinguishing idea of PVQ is the way it references a predictor. With PVQ, we do not subtract the predictor from the input to produce a residual, which is then transformed and coded. Both the predictor and the input are transformed into the frequency domain. Then, PVQ applies a reflection to both the predictor and the input such that the prediction vector lies on one of the coordinate axes, and codes the angle between them. By not subtracting the predictor from the input, the gain of the predictor can be preserved and is explicitly coded, which is one of the benefits of PVQ. Since DC is not quantized by PVQ, the gain can be viewed as the amount of contrast in an image, which is an important perceptual parameter.
Also, an input block of transform coefficients is split into frequency bands based on their spatial orientation and scale. Then, each band is quantized by PVQ separately. The 'gain' of a band indicates the amount of contrast in the corresponding orientation and scale. It is simply the L2 norm of the band. The gain is non-linearly companded and then scalar quantized and coded. The remaining information in the band, the 'shape', is then defined as a point on the surface of a unit hypersphere.
Another benefit of PVQ is activity masking based on the gain, which automatically controls the quantization resolution based on the image contrast without any signaling. For example, for a smooth image area (i.e. low contrast thus low gain), the resolution of quantization will increase, thus fewer qunatization errors will be shown. Succint summary on the benefits of PVQ can be found in the Section 2.4 of [Terriberry_16].
Since PVQ has only been used in the Daala video codec, which contains many non-traditional design elements, there has not been any chance to see the relative coding performance of PVQ compared to scalar quantization in a more traditional codec design. We have tried to apply PVQ in AV1 video codec, which is currently being developed by Alliance for Open Media (AOM) as an open source and royalty-free video codec. While most of benefits of using PVQ arise from the enhancement of subjective quality of video, compression results with activity masking enabled are not available yet in this draft because the required parameters, which were set for Daala, have not been adjusted to AV1 yet. These results were achieved optimizing solely for PSNR.
Adopting PVQ in AV1 requires replacing both the scalar quantization step and the coefficient coding of AV1 with those of PVQ. In terms of inputs to PVQ and the usage of trasnforms as shown in Figure 1 and Figure 2, the biggest conceptual changes required in a traditional coding system, such as AV1, are
input X --> +-------------+ +-------------+ | Subtraction | --> residue --> | Transform T | predictor --> +-------------+ signal R +-------------+ P | | v v T(R) [+]--> decoded X | ^ | | v | +-----------+ +-----------+ +-----------+ decoded <-- | Inverse | <--| Inverse | <-- | Scalar | R | Transform | | Quantizer | | | Quantizer | +-----------+ +-----------+ | +-----------+ v +-------------+ bitstream <--| Coefficient | of coded T(R) | Coder | +-------------+
Figure 1: Traditional architecture containing Quantization and Transforms
+-------------+ +-----------+ input X-->| Transform T |--> T(X)--> | PVQ | |_____________| | Quantizer | +-------------+ +----> +-----------+ | PVQ | +-------------+ | |------> | Coefficient | predictor-->| Transform T |--> T(P) v | Coder | P |_____________| | +-----------+ +-------------+ | | PVQ | | +----> | Inverse | v | Quantizer | bitstream +-----------+ of coded T(X) | +-----------+ v decoded X <--| Inverse | <--------- dequantized T(X) | Transform | +-----------+
Figure 2: AV1 with PVQ
In AV1, a skip flag for a partition block is true if all of quauntized coefficients in the partition are zeros. The signaling for the prediction mode in a partition cannot be skipped. If the skip flag is true with PVQ, the predicted pixels are the final decoded pixels (plus frame wise in-loop filtering such as deblocking) as in AV1 then a forward transform of a predictor is not required.
While AV1 currently defines only one 'skip' flag for each 'partition' (a unit where prediction is done), PVQ introduces another kind of 'skip' flag, called 'ac_dc_coded', which is defined for each transform block (and thus for each Y'CbCr plane as well). AV1 allows that a transform size can be smaller than a partition size which leads to partitions that can have multiple transform blocks. The ac_dc_coded flag signals whether DC and/or whole AC coefficients are coded by PVQ or not (PVQ does not quantize DC itself though).
With the encoding options specified by both NETVC (<https://tools.ietf.org/html/draft-ietf-netvc-testing-03>) and AOM testing for high latency case, PVQ gives similar coding efficiency to that of AV1, which is measured in PSNR BD-rate. Again, PVQ's activity masking is not turned on for this testing. Also, scalar quantization has matured over decades, while video coding with PVQ is much more recent.
We compare the coding efficiency for one of IETF test sequence set "objective-1-fast" defined in <https://tools.ietf.org/html/draft-ietf-netvc-testing-03>, which consists of sixteen of 1080p, seven of 720p, and seven of 640x360 sequences of various types of content, including slow/high motion of people and objects, animation, computer games and screen casting. The encoding is done for the first 30 frames of each sequence. The encodig options used is : "-end-usage=q -cq-level=x --passes=2 --good --cpu-used=0 --auto-alt-ref=2 --lag-in-frames=25 --limit=30", which is official test condition of IETF and AOM for high latency encoding except limiting 30 frames only.
For comparison reasons, some of the lambda values used in RDO are adjusted to match the balance of luma and chroma quality of the PVQ-enabled AV1 to that of current AV1.
The result are shown in Table 1, which is the BD-Rate change for several image quality metrics. (The encoders used to generate this result are available from author's git repository <https://github.com/ycho/aom/commit/2478029a9b6d02ee2ccc9dbafe7809b5ef345814> and AOM's repositiony <https://aomedia.googlesource.com/aom/+/59848c5c797ddb6051e88b283353c7562d3a2c24>.)
Metric | AV1 --> AV1 with PVQ |
---|---|
PSNR | -0.17% |
PSNR-HVS | 0.27% |
SSIM | 0.93% |
MS-SSIM | 0.14% |
CIEDE2000 | -0.28% |
Total encoding time increases roughly 20 times or more when intensive RDO options, such as "--passes=2 --good --cpu-used=0 --auto-alt-ref=2 --lag-in-frames=25", are turned on. The significant increase in encoding time is due to the increase of computation by the PVQ. The PVQ tries to find asymptotically-optimal codepoints (in RD optimization sense) on a hypershpere with a greedy search, which has the time complexity close to O(n*n) for n coefficients. Meanwhile, scalar quantization has the time complexity of O(n).
Comparing to Daala, the search space for a RDO decision in AV1 is far larger because AV1 considers ten intra prediction modes and four different transforms (for the transform block sizes 4x4, 8x8, and 16x16 only), and the transform block size can be smaller than the prediction block size. Since the largest transform and the prediction sizes are currently 32x32 and 64x64 in AV1, PVQ can be called approximately 5,160 times more in AV1 than in Daala. Also, AV1 uses transform and quantization for each candidate of RDO.
As an example, AV1 calls PVQ function 632,520 times to encode the grandma_qcif (176x144) in intra frame mode while Daala calls 3843 times only (for QP = 30 and 39 for AV1 and daala respectively, which corresponds to actual quantizer used for quantization being 38). So, PVQ was called 165 times more in AV1 than Daala.
Table 2 shows the frequency of function calls to PVQ and scaler quantizers in AV1 at each speed level (where AV1 encoding mode is 'good') for the same sequence and the QP as used in the above example. The first column indicates speed level, the second column shows the number of calls to PVQ's search inside each band (function pvq_search_rdo_double() in <https://github.com/ycho/aom/blob/14981eebb4a08f74182cea3c17f7361bc79cf04f/av1/encoder/pvq_encoder.c#L84>), the third column shows the number of calls to PVQ quantization of a transfrom block (function od_pvq_encode() in <https://github.com/ycho/aom/blob/14981eebb4a08f74182cea3c17f7361bc79cf04f/av1/encoder/pvq_encoder.c#L763>), and the fourth column shows the number of calls to AV1's block quantizer. Smaller speed level gives slower encoding but better quality for the same rate by doing more RDO optimizations. The major difference from speed level 4 to 3 is enabling a use of the transform block smaller than the prediction (i.e. partition) block.
Speed Level | # of calls to AV1 quantizer | # of calls to PVQ quantizer | # of calls to PVQ search inside a band |
---|---|---|---|
5 | 28,028 | 26,786 | 365,913 |
4 | 57,445 | 56,980 | 472,222 |
3 | 505,039 | 564,724 | 3,680,366 |
2 | 505,039 | 564,724 | 3,680,366 |
1 | 535,100 | 580,566 | 3,990,327 |
0 | 589,931 | 632,520 | 4,109,113 |
Possible future works include:
The ongoing work of integrating PVQ into AV1 video codec is located at the git repository <https://github.com/ycho/aom/tree/av1_pvq>.
Thanks to Tim Terriberry for his proof reading and valuable comments. Also thanks to Guillaume Matres for his contibutions to intergrating PVQ into AV1 during his intership at Mozilla and Thomas Daede for providing and maintaining the testing infrastructure by way of www.arewecompressedyet.com (AWCY) web site <https://arewecompressedyet.com/>.
This memo includes no request to IANA.
[I-D.valin-netvc-pvq] | Valin, J., "Pyramid Vector Quantization for Video Coding", Internet-Draft draft-valin-netvc-pvq-00, June 2015. |
[Perceptual-VQ] | Valin, JM. and TB. Terriberry, "Perceptual Vector Quantization for Video Coding", Proceedings of SPIE Visual Information Processing and Communication , February 2015. |
[PVQ-demo] | Valin, JM., "Daala: Perceptual Vector Quantization (PVQ)", November 2014. |
[Terriberry_16] | Terriberry, TB., "Perceptually-Driven Video Coding with the Daala Video Codec", Proceedings SPIE Volume 9971, Applications of Digital Image Processing XXXIX , September 2016. |