Internet-Draft | sceb | July 2019 |
Morton & Grimes | Expires 3 January 2020 | [Page] |
This memo reclassifies ECT(1) to be an early notification of congestion on ECT(0) marked packets, which can be used by AQM algorithms and transports as an earlier signal of congestion than CE. It is a simple, transparent, and backward compatible upgrade to existing IETF-approved AQMs, RFC3168, and nearly all congestion control algorithms.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 3 January 2020.¶
Copyright (c) 2019 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119] and [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This memo reclassifies ECT(1) to be an early notification of congestion on ECT(0) marked packets, which can be used by AQM algorithms and transports as an earlier signal of congestion than CE ("Congestion Experienced").¶
This memo limits its scope to the redefinition of the ECT(1) codepoint as SCE, "Some Congestion Experienced", with a few brief illustrations of how it may be used.¶
[RFC3168] defines the lower two bits of the (former) TOS byte in the IPv4/6 header as the ECN field. This may take four values: Not-ECT, ECT(0), ECT(1) or CE.¶
Binary Keyword References¶
------------------------------------------------------------¶
00 Not-ECT (Not ECN-Capable Transport) [RFC3168] 01 ECT(1) (ECN-Capable Transport(1)) [RFC3168] 10 ECT(0) (ECN-Capable Transport(0)) [RFC3168] 11 CE (Congestion Experienced) [RFC3168]¶
Research has shown that the ECT(1) codepoint goes essentially unused, with the "Nonce Sum" extension to ECN having not been implemented in practice and thus subsequently obsoleted by [RFC8311] (section 3). Additionally, known [RFC3168] compliant senders do not emit ECT(1), and compliant middleboxes do not alter the field to ECT(1), while compliant receivers all interpret ECT(1) identically to ECT(0). These are useful properties which represent an opportunity for improvement.¶
Experience gained with 7 years of [RFC8290] deployment in the field suggests that it remains difficult to maintain the desired 100% link utilisation, whilst simultaneously strictly minimising induced delay due to excess queue depth - irrespective of whether ECN is in use. This leads to a reluctance amongst hardware vendors to implement the most effective AQM schemes because their headline benchmarks are throughput-based.¶
The underlying cause is the very sharp "multiplicative decrease" reaction required of transport protocols to congestion signalling (whether that be packet loss or CE marks), which tends to leave the congestion window significantly smaller than the ideal BDP when triggered at only slightly above the ideal value. The availability of this sharp response is required to assure network stability (AIMD principle), but there is presently no standardised and backwards-compatible means of providing a less drastic signal.¶
As consensus has arisen that some form of ECN signaling should be an earlier signal than drop, this Internet Draft changes the meaning of ECT(1) to be SCE, meaning "Some Congestion Experienced". The above ECN-field codepoint table then becomes:¶
Binary Keyword References¶
------------------------------------------------------------¶
00 Not-ECT (Not ECN-Capable Transport) [RFC3168] 01 SCE (Some Congestion Experienced) [This Internet-draft] 10 ECT (ECN-Capable Transport) [RFC3168] 11 CE (Congestion Experienced) [RFC3168]¶
This permits middleboxes implementing AQM to signal incipient congestion, below the threshold required to justify setting CE, by converting some proportion of ECT codepoints to SCE ("SCE marking"). Existing [RFC3168] compliant receivers MUST transparently ignore this new signal with respect to congestion control, and both existing and SCE-aware middleboxes MAY convert SCE to CE in the same circumstances as for ECT, thus ensuring backwards compatibility with [RFC3168] ECN endpoints.¶
Permitted ECN codepoint packet transitions by middleboxes are:¶
Not-ECT -> Not-ECT or DROP ECT -> ECT or SCE or CE or DROP SCE -> SCE or CE or DROP CE -> CE or DROP¶
In other words, for ECN-aware flows, the ECN marking of an individual packet MAY be increased by a middlebox to signal congestion, but MUST NOT be decreased, and packets SHALL NOT be altered to appear to be ECN-aware if they were not originally, nor vice versa. Note however that SCE is numerically less than ECT, but semantically greater, and the latter definition applies for this rule.¶
New SCE-aware receivers and transport protocols SHALL continue to apply the [RFC3168] interpretation of the CE codepoint, that is, to signal the sender to back off send rate to the same extent as if a packet loss were detected. This maintains compatibility with existing middleboxes, senders and receivers.¶
New SCE-aware receivers and transport protocols SHOULD interpret the SCE codepoint as an indication of mild congestion, and respond accordingly by applying send rates intermediate between those resulting from a continuous sequence of ECT codepoints, and those resulting from a CE codepoint. The ratio of ECT and SCE codepoints received indicates the relative severity of such congestion, such that 100% SCE is very close to the threshold of CE marking, 100% ECT indicates that the bottleneck link may not be fully utilised, and some mixture of ECT and SCE codepoints indicates that some degree of queuing delay exists at the bottleneck link.¶
Details of how to implement SCE awareness at the transport layer will be left to additional Internet Drafts yet to be submitted. To ensure RTT-fair convergence with single-queue SCE AQMs, transports SHOULD stabilise at lower SCE-mark ratios for higher BDPs, and MAY reduce their response to CE marks IFF they are responding to SCE signals received at around the same time (eg. within 1-2 RTTs) in the same flow.¶
To maximise the benefit of SCE, middleboxes SHOULD produce SCE markings sooner than they produce CE markings, when the level of congestion increases. Single-queue AQMs MAY choose to prefer fairness between SCE and non-SCE flows, by instead beginning SCE marking at the lowest threshold of CE marking.¶
A simple and natural way to implement SCE in a Codel-type AQM is to mark all ECT packets as SCE if they are over half the Codel target sojourn time, and not marked CE by Codel itself. This threshold function does not necessarily produce the best performance, but is very easy to implement and provides useful information to SCE-aware flows, often sufficient to avoid receiving CE marks whilst still efficiently using available capacity.¶
For a more sophisticated approach avoiding even small-scale oscillation, a stochastic ramp function may be implemented with 100% marking at the Codel target, falling to 0% marking at or above zero sojourn time. The lower point of the ramp should be chosen so that SCE is not accidentally signalled due to CPU scheduling latencies or serialisation delays of single packets. Absent rigorous analysis of these factors, setting the lower limit at half the Codel target should be safe in many cases.¶
The default configuration of Codel is 100ms interval, 5ms target. A typical ramp function for these parameters might cease marking below 2.5ms sojourn time, increase marking probability linearly to 100% at 5ms, and mark at 100% for sojourn times above 5ms (in which CE marking is also possible).¶
In single-queue AQMs, the above strategy will result in SCE flows yielding to pressure from non-SCE flows, since CE marks do not occur until SCE marking has reached 100%. A balance between smooth SCE behaviour and fairness versus non-SCE traffic can be found by having the marking ramp cross the Codel target at some lower SCE marking rate, perhaps even 0%. A two-part ramp, reaching 1/sqrt(X) at the Codel target (for some chosen X, a cwnd at which the crossover between smoothness and fairness occurs) and ramping up more steeply thereafter, has been implemented successfully for experimentation.¶
Flow-isolating AQMs should avoid signalling SCE to flows classified as "sparse" in order to encourage the fastest possible convergence to the fair share.¶
For the avoidance of doubt, a decision to mark CE or to drop a packet always takes precedence over SCE marking.¶
There are several reasonable methods of producing SCE signals in a RED-type AQM.¶
The simplest would be a threshold function, giving a hard boundary in queue depth between 0% and 100% SCE marking. This could be a sensible option for limited hardware implementations. The threshold should be set below the point at which a growing queue might trigger CE marking or packet drops.¶
Another option would be to implement a second marking probability function, occupying a queue-depth space just below that occupied by the main marking probability function. This should be arranged so that high marking rates (ideally 100%) are achieved at or before the point at which CE marking or packet drops begin.¶
For PIE specifically, a second marking probability function could be added with the same parameters as the main marking probability function, except for a lower QDELAY_REF value. This would result in the SCE marking probability remaining strictly higher than the CE marking probability for ECT flows.¶
Some mechanism should be defined to feed back SCE signals to the sender explicitly. Details of this are left to future I-Ds covering TCP in detail; use could be made of the redundant NS bit in the TCP header, which was formerly associated with ECT(1) in the Nonce Sum specification.¶
Alternatively, SCE can potentially be handled entirely by the receiver, and thus be entirely independent of any of the dozens of [RFC3168] compliant congestion control algorithms on the sender side. This would be done by adjusting the Receive Window, which has been a standard part of TCP from its inception. This alternative therefore requires the minimum amount of protocol changes on the wire.¶
The recommended response to each single segment marked with SCE is to reduce cwnd by an amortised 1/sqrt(cwnd) segments. Other responses, such as the 1/cwnd from DCTCP, are also acceptable but may perform less well.¶
There are no IANA considerations.¶
An adversary could inappropriately set SCE marks at middleboxes he controls to slow down SCE-aware flows, eventually reaching a minimum congestion window. However, the same threat already exists with respect to inappropriately setting CE marks on normal ECN flows, and this would have a greater impact per mark. Therefore no new threat is exposed by SCE in practice.¶
An adversary could also simply ignore SCE marks at the receiver, or ignore SCE information fed back from the receiver to the sender, in an attempt to gain some advantage in throughput. Again, the same could be said about ignoring CE marks, so no truly new threat is exposed. Additionally, correctly implemented SCE detection may actually improve long-term goodput compared to ignoring SCE.¶
An adversary could erase congestion information by converting SCE marks to ECT or Not-ECT codepoints, thus hiding it from the receiver. This has equivalent effects to ignoring SCE signals at the receiver. An identical threat already exists for erasing congestion information from CE marked packets, and may be mitigated by AQMs switching to dropping packets from flows observed to be non-responsive to CE.¶
Thanks to Dave Taht for his contributions to the SCE effort, and his work on writing the original draft-morton-taht-sce-00 that was submitted for IETF/104 on which this draft is based.¶
Many thanks to John Gilmore, the members of the ecn-sane project and the cake@lists.bufferbloat.net mailing list, and the former IETF AQM working group.¶