Network Working Group | S. Midtskogen |
Internet-Draft | A. Fuldseth |
Intended status: Standards Track | M. Zanaty |
Expires: May 4, 2017 | Cisco |
October 31, 2016 |
Constrained Low Pass Filter
draft-midtskogen-netvc-clpf-03
This document describes a low complexity filtering technique which is being used as a low pass loop filter in the Thor video codec.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 4, 2017.
Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
Modern video coding standards such as Thor [I-D.fuldseth-netvc-thor] include in-loop filters which correct artifacts introduced in the encoding process. Thor includes a deblocking filter which corrects artifacts introduced by the block based nature of the encoding process, and a low pass filter correcting artifacts not corrected by the deblocking filter, in particular artifacts introduced by quantisation errors of transform coefficients and by the interpolation filter. Since in-loop filters have to be applied in both the encoder and decoder, it is highly desirable that these filters have low computational complexity.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].
This document will refer to a pixel X and six of its neighbouring pixels A, B, C, D, E, F ordered in the following pattern.
+---+---+---+---+---+ | | | A | | | +---+---+---+---+---+ | B | C | X | D | E | +---+---+---+---+---+ | | | F | | | +---+---+---+---+---+
Figure 1: Filter pixel positions
In Thor the frames are divided into filter blocks (FB) of 128x128, 64x64 or 32x32 pixels, which is signalled for each frame to be filtered. Also, each frame is divided into coding blocks (CB) which range from 8x8 to 128x128 independent of the FB size. The filter described in this draft can be switched on or off for the entire frame or optionally on or off for each FB. CB's that have been coded using the skip mode are not filtered, and if a FB only contains CB's that have been coded in skip mode, the FB will not be filtered and no signal will be transmitted for this FB.
If the frame can't fit a whole number of FB's, the FB's at the right and bottom edges are clipped to fit. For instance, if the frame resolution is 1920x1080 and the FB size is 128x128, the size of the FB's at the bottom of the frame becomes 128x56.
Given a pixel X and its neighbouring pixels described above we can define a general non-linear filter as:
X' = X + clip(a*clip(A-X,-s,s) + b*clip(B-X,-s,s) + c*clip(C-X,-s,s) + d*clip(D-X,-s,s) + e*clip(E-X,-s,s) + f*clip(F-X,-s,s),-g,g)
Figure 2: Equation 1
If a neighbour pixel is outside the image frame, it is given the same value as the closest pixel within the frame. To avoid dependencies prohibiting parallel processing, all neighbour pixels must be the unfiltered pixels of the frame being filtered.
Experiments in Thor have shown that a good compromise between complexity and performance is a=f=1/4, b=e=1/16, c=d=3/16 and the filter strength s being 1, 2 or 4 signalled at frame level when the bitdepth is 8. The strengths are scaled according to the bitdepth, so they become 4, 8 and 16 when the bitdepth is 10, and 16, 32 and 64 when the bitdepth is 12. The values for a, b, c, d, e and f eliminate the need for the outer clipping to +/-g. The rounding is to the nearest integer.
This gives us the equation:
X' = X + (4*clip(A-X,-s,s) + clip(B-X,-s,s) + 3*clip(C-X,-s,s) + 3*clip(D-X,-s,s) + clip(E-X,-s,s) + 4*clip(F-X,-s,s)) / 16
Figure 3: Equation 2
It can be noted that a=c=d=f=1/4, b=e=0 and s=1 give a slighly simpler filter which is very similar to the one described in the first version of this draft.
The filter leaves the encoder 13 different choices for a frame. The filter can be disabled for the entire frame, or the frame is filtered using all distinct combinations of strength (1, 2 or 4 scaled for bitdepth), non-skip FB signal (enabled/disabled) and FB size (32x32, 64x64 or 128x128). Note that the FB size only matters when FB signalling is in use.
The decisions at both frame level and FB level may be based on rate-distortion optimisation (RDO), but an encoder running in a low-complexity mode, or possibly a low-delay mode, may instead assume that a fixed mode will be beneficial. In general, using s=2, a QP dependent FB size and RDO only at the FB level gives good results.
However, because of the low complexity of the filter, fully RDO based decisions are not costly. The distortion of the 13 configurations of the filter can easily be computed in a single pass by keeping track of the distortions of the three different strengths and the bit costs for different FB sizes.
The filter is applied after the deblocking filter.
The filter has been designed to offer the best compromise between low complexity and performance. A single pixel can be filtered with simple operations as illustrated by this C function:
int clpf_sample(int X, int A, int B, int C, int D, int E, int F, int s) { int delta = 4*clip(A - X, -s, s) + clip(B - X, -s, s) + 3*clip(C - X, -s, s) + 3*clip(D - X, -s, s) + clip(E - X, -s, s) + 4*clip(F - X, -s, s); return (8 + delta - (delta < 0)) >> 4; // Assumes arithmetic shift }
Figure 4: C code
Also, these operations are easily vectorised in architectures supporting SIMD instructions, such as x86/SSE4 and ARM/NEON. The pixel difference is 9 bit, but it can be computed using adding an 8 bit offset and the use of 8 bit saturated signed subtraction. This means that 16 pixels per core can be filtered in parallel on these architectures. Clipping at frame borders can be implemented using shuffle instructions.
A C implementation using x86/SSE4 intrinsics required 6.8 instructions per pixel to filter a single 8x8 block. The corresponding number for ARM/NEON (armv7) was 4.9. The compiler was gcc 4.8.4 in both cases.
Since the filter only needs to look up pixels in the line directly above and below the pixel to be filtered, the line buffer requirement in hardware implementations is very low.
The table below shows filters effect on the bandwidth for a selection of 10 second video sequences encoded in Thor with uni-prediction only. The numbers have been computed using the Bjontegaard Delta Rate (BDR). BDR-low and BDR-high indicate the effect at low and high bitrates, respectively, as described in BDR [BDR].
The effect of the filter was tested in two encoder low-delay configurations: high complexity in which the encoder strongly favours compression efficiency over CPU usage, and medium complexity which is more suited for real-time applications. The bandwidth reduction is somewhat less in the high complexity configuration.
+----------------+--------------------+--------------------+ | | MEDIUM COMPLEXITY | HIGH COMPLEXITY | +----------------+------+------+------+--------------------+ | | | BDR- | BDR- | | BDR- | BDR- | |Sequence | BDR | low | high | BDR | low | high | +----------------+------+------+------+------+------+------+ |Kimono | -2.7%| -2.3%| -3.4%| -1.9%| -1.8%| -2.0%| |BasketballDrive | -3.3%| -2.5%| -4.5%| -2.1%| -1.6%| -3.0%| |BQTerrace | -7.2%| -4.9%| -9.1%| -5.5%| -3.7%| -6.7%| |FourPeople | -5.7%| -3.9%| -8.6%| -4.0%| -2.8%| -6.0%| |Johnny | -5.9%| -4.0%| -9.0%| -4.7%| -4.0%| -5.8%| |ChangeSeats | -6.4%| -3.4%|-10.8%| -4.5%| -2.8%| -6.8%| |HeadAndShoulder | -8.6%| -2.6%|-18.8%| -5.8%| -2.2%|-11.1%| |TelePresence | -5.9%| -3.1%|-10.7%| -4.0%| -2.0%| -7.0%| +----------------+------+------+------+--------------------+ |Average | -5.7%| -3.3%| -9.4%| -4.0%| -2.6%| -6.0%| +----------------+------+------+------+--------------------+
Figure 5: Compression Performance without Biprediction
While the filter objectively performs better at relatively high bitrates, the subjective effect seems better at relatively low bitrates, and overall the subjective effect seems better than what the objective numbers suggest.
If biprediction is allowed, there is generally less bandwidth reduction as the table below shows. These results reflect low-delay biprediction without frame reordering.
+----------------+--------------------+--------------------+ | | MEDIUM COMPLEXITY | HIGH COMPLEXITY | +----------------+------+------+------+--------------------+ | | | BDR- | BDR- | | BDR- | BDR- | |Sequence | BDR | low | high | BDR | low | high | +----------------+------+------+------+------+------+------+ |Kimono | -2.2%| -1.8%| -2.7%| -1.4%| -1.3%| -1.5%| |BasketballDrive | -2.6%| -2.5%| -2.7%| -1.4%| -1.6%| -1.1%| |BQTerrace | -4.1%| -3.1%| -4.7%| -2.7%| -2.7%| -2.5%| |FourPeople | -4.0%| -2.9%| -5.3%| -2.7%| -1.9%| -3.4%| |Johnny | -3.5%| -2.7%| -4.6%| -2.2%| -1.6%| -3.1%| |ChangeSeats | -4.2%| -3.0%| -6.1%| -2.6%| -2.0%| -3.2%| |HeadAndShoulder | -4.1%| -2.9%| -6.1%| -2.3%| -1.8%| -2.8%| |TelePresence | -2.8%| -1.9%| -4.3%| -1.6%| -1.2%| -2.1%| +----------------+------+------+------+------+------+------+ |Average | -3.4%| -2.6%| -4.6%| -2.1%| -1.9%| -2.5%| +----------------+------+------+------+------+------+------+
Figure 6: Compression Performance with Biprediction
This document has no IANA considerations yet. TBD
This document has no security considerations yet. TBD
The authors would like to thank Gisle Bjontegaard for reviewing this document and design, and providing constructive feedback and direction.
[I-D.fuldseth-netvc-thor] | Fuldseth, A., Bjontegaard, G., Midtskogen, S., Davies, T. and M. Zanaty, "Thor Video Codec", Internet-Draft draft-fuldseth-netvc-thor-02, March 2016. |
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997. |
[BDR] | Bjontegaard, G., "Calculation of average PSNR differences between RD-curves", ITU-T SG16 Q6 VCEG-M33 , April 2001. |