Internet DRAFT - draft-even-tsvwg-datacenter-fast-congestion
draft-even-tsvwg-datacenter-fast-congestion
TSVWG R. Even
Internet-Draft M. Liu
Intended status: Informational Y. Zhang
Expires: August 7, 2020 Huawei
February 4, 2020
Data Center Fast Congestion Management
draft-even-tsvwg-datacenter-fast-congestion-00
Abstract
A good congestion control for data centers (DC) should provide low
latency, fast convergence and high link utilization. Since multiple
applications with different requirements may run on the DC network it
is important to provide fairness between different applications that
may use different congestion algorithms. An important issue from the
user perspective is to achieve short Flow Completion Time (FCT).
This document proposes data center congestion control direction
aiming to achieve high performance while proving fairness.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on August 7, 2020.
Copyright Notice
Copyright (c) 2020 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
Even, et al. Expires August 7, 2020 [Page 1]
Internet-Draft DC Fast Congestion February 2020
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Congestion Handling Cases . . . . . . . . . . . . . . . . . . 3
3.1. Congestion only in leaf switch connected to receiver . . 3
3.2. Congestion in the Spine switch . . . . . . . . . . . . . 4
3.2.1. ECN case . . . . . . . . . . . . . . . . . . . . . . 4
3.2.2. Spine and leaf switches share information . . . . . . 4
3.2.3. FCR from spine and leaf switches . . . . . . . . . . 4
3.3. Congestion in leaf switch connected to data sender . . . 4
4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5. Rate Information . . . . . . . . . . . . . . . . . . . . . . 5
6. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 5
7. Implementation Options . . . . . . . . . . . . . . . . . . . 6
8. Tests results . . . . . . . . . . . . . . . . . . . . . . . . 6
8.1. Many senders to one receiver . . . . . . . . . . . . . . 6
9. Security Considerations . . . . . . . . . . . . . . . . . . . 8
10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8
11. References . . . . . . . . . . . . . . . . . . . . . . . . . 9
11.1. Normative References . . . . . . . . . . . . . . . . . . 9
11.2. Informative References . . . . . . . . . . . . . . . . . 9
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10
1. Introduction
The major use case that we are looking at is congestion control for
Data Centers, a controlled environment as specified in
RFC8085[RFC8085]. With the emerging Distributed Storage, AI/HPC
(High Performance Computing), Machine Learning, etc., modern
datacenter applications demand high throughput (40Gbps and above)
with ultra-low latency of less than 10 microsecond per hop from the
network, with low CPU overhead. The end to end latency should be
less than 50usec, this value is based on DCQCN [DCQCN]. The high
link speed (>40Gb/s) in Data Centers (DC) are making network
transfers complete faster and in fewer RTTs. Network traffic in a
data center is often a mix of short and long flows, where the short
flows require low latencies and the long flows require high
throughputs.
A good congestion control for data centers (DC) should provide low
latency, fast convergence and high link utilization. Since multiple
applications with different requirements may run on the DC network it
Even, et al. Expires August 7, 2020 [Page 2]
Internet-Draft DC Fast Congestion February 2020
is important to provide fairness between different applications that
may use different congestion algorithms. An important issue from the
user perspective is to achieve short Flow Completion Time (FCT).
A typical DC architecture is composed of a spine-leaf topology where
there are three hop switches at most for a flow. If we look from the
flow view then we can assume that for the first hop switch there is
low probability for congestion. The congestion will happen in higher
probability at the spine or the last hop. The figure bellow shows a
simple spine-leaf topology; in a typical DC there will be multiple
Spines and Leaves.
--------
|Spine1|
--------
| \
| \
| \
------ ------ ------- ------
| NIC|__|Leaf1| |Leaf2| ____ |NIC |
|Send| ------ ------- |Recv|
------ ------
2. Conventions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP
14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
3. Congestion Handling Cases
3.1. Congestion only in leaf switch connected to receiver
The leaf switch is congested and does not receive any ECN CE marking
on incoming streams. The leaf switch sends FCR (Fast Congestion
Response) message to all sending NICs. The general case requires
that the leaf switch will know who are the senders and if they
support FCR. There is also a requirement to define how the congested
leaf connects and send the FCR message to the senders. If not all
senders whose streams are congesting the same egress port support FCR
the congested leaf switch will drop back to use ECN CE marking to the
receiver. Another option is to send FCR to the senders that support
it and use ECN CE marking on the flows from senders that do not
support FCR, in this case the switch should wait for at least one RTT
Even, et al. Expires August 7, 2020 [Page 3]
Internet-Draft DC Fast Congestion February 2020
before sending a second FCR to allow all senders to drop their
sending rate.
3.2. Congestion in the Spine switch
There are a couple of options for supporting this case. The Spine
and leaf switch will need to be aware of which option is in use.
3.2.1. ECN case
The leaf switch receives ECN CE marks from the spine. The leaf
switch does not know what rate information it can send regardless if
it is congested or not. The leaf switch will convey ECN marking to
the receiver.
3.2.2. Spine and leaf switches share information
The Spine switch provides rate/congestion information to the
downstream leaf switch. The leaf switch may be congested or not but
will be responsible to send the FCR message to the sending NICs. The
information from the spine may provide rate information using an FCR
like message.
3.2.3. FCR from spine and leaf switches
The Spine switch will send FCR to the sending NICs and will not send
ECN marking to the downstream leaf switch. In this option if there
is also congestion on the downstream leaf a second FCR message will
be sent from the leaf to the sending NIC who will have to use the
lower recommended rate information.
3.3. Congestion in leaf switch connected to data sender
This case has lower probability but in case of congestion the leaf
switch will send FCR message to all the contributing NICs of the flow
causing the congestion on the congested egress port. If FCR is not
supported by all the congesting NICs, the switch will CE mark these
flows, this will cause the FCR supporting NICs to respond faster and
the switch should allow the other streams to respond (wait little
over an RTT time) before sending another FCR.
4. Summary
If all NICs currently sending data to the leaf switch support FCR
messages it is safe to use FCR and if the congestion is in the Spine
switch the action will be according to the options in section
Section 3.2.
Even, et al. Expires August 7, 2020 [Page 4]
Internet-Draft DC Fast Congestion February 2020
If the leaf switch knows that not all NICs sending data through the
switch support FCR, the leaf switch may fall back to ECN marking.
Another option is to use mixed mode by sending FCR to supporting NICs
and ECN marks towards the receiver, the senders that support FCR
should use the received FCR and ignore the ECN message from the
receiver.
In the case where there are multiple congestion points, the NIC
should use the lowest rate information from all received FCRs.
5. Rate Information
The leaf switch needs to supply rate information using the FCR
message. The same rate information will be sent to all data senders
to the congested port regardless of the rate they needed. This may
cause underutilization of the available bandwidth if some of them
have no need for all the recommended rate; this will be addressed by
the leaf switch sending updated rate information based on the current
usage after a number of RTTs. The leaf switch may also send updated
FCR message when more bandwidth is available, for example when
senders stop sending. Note that sending such information may cause
congestion on upstream switches; another option is to use the sender
congestion control to raise the sending rate according to its CC
algorithm.
In the tests that were done so far, the solution was that all senders
received the same rate information. We need to specify what we would
like to send as the content of the rate information in the FCR
message (bits/sec, number of bytes to send similar to wnd in TCP).
6. Requirements
To support FCR based on the above use cases requires:
1. The congested leaf should be able to know which data sources
support FCR.
2. The congested leaf should be able to send the FCR message in-path
for example by using TCP/UDP options or in the UDP applications
back channel. Another option is to establish a connection to the
data senders and send FCR messages to them.
3. Sender should be able to start sending at maximum rate if the new
stream is the only stream sent by the sender.
Even, et al. Expires August 7, 2020 [Page 5]
Internet-Draft DC Fast Congestion February 2020
7. Implementation Options
The FCR message from the network to the data sender MUST only be
deployed in a controlled environment [RFC8085] such as Data Centers.
The FCR message should provide an identification of the stream for
example by providing the source and destination IP and Port number of
the flow.
FCR should only be deployed in an intra-data-center environment where
both endpoints and the switching fabric are under a single
administrative domain. FCR MUST NOT be deployed over the public
Internet
1. The tests are based on ROCEv2 [RoCEv2] using a revised CNP
message and assume all senders support FCR. To use this option
for ROCEv2 the data sender should mark support for the revised
CNP message, this will allow the leaf switch to know if it can
send back the revised CNP. This implementation mode is for
testing only, we do not propose this mode for a solution.
2. For the proposed solution for the general case there may be a
couple of options. The preference is to use a generic message at
the transport level (TCP/UDP) otherwise will need a different
message per application. The suggested proposal is to use new
TCP option [RFC0793] and [RFC2460] for IPv6 that can be piggy
backed on the ack message. There should be an FCR support option
sent by the data sender. For UDP where a back channel is usually
in the application layer we can use UDP options
[I-D.ietf-tsvwg-udp-options] for announcing FCR support and using
the application back channel in an application extension or in a
UDP option to send FCR (In the testing we used a revised CNP
message for ROCEv2). Another option is to use IOAM like
mechanism (the general IOAM specification is
[I-D.ietf-ippm-ioam-data], the loopback option is in
[I-D.ietf-ippm-ioam-flags], sending message from the leaf switch
can be based on https://tools.ietf.org/id/draft-ioamteam-ippm-
ioam-direct-export-00.txt)
8. Tests results
Note: this can be an appendix later if relevant
8.1. Many senders to one receiver
In this test scenario we had six senders and one receiver on a single
switch on a 25 Gbit/sec connection. Five senders were sending long
flows to create congestion and the sixth sender sent continuous 8
Bytes packets to test latency.
Even, et al. Expires August 7, 2020 [Page 6]
Internet-Draft DC Fast Congestion February 2020
+---------------------+--------+------------+-----------------------+
| Network Average | NIC CC | Network CC | Improvement |
| Load | | | percentage |
+---------------------+--------+------------+-----------------------+
| 30% | 1.61 | 1.61 | 0.00% |
| 50% | 2.68 | 2.68 | 0.00% |
| 80% | 4.23 | 4.24 | 0.24% |
| 100% | 4.36 | 4.51 | 3.44% |
+---------------------+--------+------------+-----------------------+
Sender NIC BW(Gbps)
The bandwidth of NIC CC and Network CC are almost the same in the
long flows case
+--------------------+---------+------------+-----------------------+
| Network Average | NIC CC | Network CC | Improvement |
| Load | | | percentage |
+--------------------+---------+------------+-----------------------+
| 30% | 5.89 | 5.79 | 1.70% |
| 50% | 6.04 | 6.04 | 0.00% |
| 80% | 7.33 | 6.67 | 9.00% |
| 100% | 7.45 | 6.78 | 8.99% |
+--------------------+---------+------------+-----------------------+
Latency flow result(us) - Average
+--------------------+---------+------------+-----------------------+
| Network Average | NIC CC | Network CC | Improvement |
| Load | | | percentage |
+--------------------+---------+------------+-----------------------+
| 30% | 8.89 | 7.65 | 13.95% |
| 50% | 9.14 | 8.46 | 7.44% |
| 80% | 12.94 | 8.74 | 32.46% |
| 100% | 11.6 | 8.74 | 24.66% |
+--------------------+---------+------------+-----------------------+
Latency flow result(us) - 99%
Even, et al. Expires August 7, 2020 [Page 7]
Internet-Draft DC Fast Congestion February 2020
+--------------------+---------+-------------+----------------------+
| Network Average | NIC CC | Network CC | Improvement |
| Load | | | percentage |
+--------------------+---------+-------------+----------------------+
| 30% | 21.77 | 8.79 | 59.62% |
| 50% | 24.89 | 11.8 | 52.59% |
| 80% | 23.45 | 9.36 | 60.09% |
| 100% | 22.91 | 9.19 | 59.89% |
+--------------------+---------+-------------+----------------------+
Latency flow result(us) - 99.9%
We can see that the average latency is reduced by maximum 9% and the
99.9% latency which indicates the maximum queue size is reduced by
maximum 60%.
The results show in that for the long flow many-to-one situation the
Network CC achieves the same bandwidth as the NIC CC and better
latency for mice flow.
9. Security Considerations
The FCR message is hard to secure, sending an FCR message from the
network to the source has security risks since it can be easily used
for DOS attack. This solution must only be used in a managed network
[RFC8085]. The FCR message must be terminated in the managed network
and should not cross the network domain.
Since this message is sent in a closed managed network it does not
have the same security concerns as ICMP source quench message
[RFC5927] defined on the general Internet.
An attacker can send an FCR message with lower or higher rate
information. This may cause an underutilization of the network or
congestion. The network entity closest to the receiver should
provide an alert if an unexpected rate is being used which may hint
that such an attack is taking place. A sender may also try to
identify if the FCR message has rate information in the expected
range.
10. IANA Considerations
TBD
Even, et al. Expires August 7, 2020 [Page 8]
Internet-Draft DC Fast Congestion February 2020
11. References
11.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/info/rfc8174>.
11.2. Informative References
[DCQCN] Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M.,
Liron, Y., Padhye, J., Raindel, S., Yahia, M. H., and M.
Zhang, "Congestion control for large-scale RDMA
deployments. In ACM SIGCOMM Computer Communication Review,
Vol. 45. ACM, 523-536.", 8 2015,
<https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/
p523.pdf>.
[I-D.ietf-ippm-ioam-data]
Brockners, F., Bhandari, S., Pignataro, C., Gredler, H.,
Leddy, J., Youell, S., Mizrahi, T., Mozes, D., Lapukhov,
P., Chang, R., daniel.bernier@bell.ca, d., and J. Lemon,
"Data Fields for In-situ OAM", draft-ietf-ippm-ioam-
data-07 (work in progress), September 2019.
[I-D.ietf-ippm-ioam-flags]
Mizrahi, T., Brockners, F., Bhandari, S., Sivakolundu, R.,
Pignataro, C., Kfir, A., Gafni, B., Spiegel, M., and J.
Lemon, "In-situ OAM Flags", draft-ietf-ippm-ioam-flags-00
(work in progress), October 2019.
[I-D.ietf-tsvwg-udp-options]
Touch, J., "Transport Options for UDP", draft-ietf-tsvwg-
udp-options-08 (work in progress), September 2019.
[RFC0793] Postel, J., "Transmission Control Protocol", STD 7,
RFC 793, DOI 10.17487/RFC0793, September 1981,
<https://www.rfc-editor.org/info/rfc793>.
[RFC2460] Deering, S. and R. Hinden, "Internet Protocol, Version 6
(IPv6) Specification", RFC 2460, DOI 10.17487/RFC2460,
December 1998, <https://www.rfc-editor.org/info/rfc2460>.
Even, et al. Expires August 7, 2020 [Page 9]
Internet-Draft DC Fast Congestion February 2020
[RFC5927] Gont, F., "ICMP Attacks against TCP", RFC 5927,
DOI 10.17487/RFC5927, July 2010,
<https://www.rfc-editor.org/info/rfc5927>.
[RFC8085] Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage
Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085,
March 2017, <https://www.rfc-editor.org/info/rfc8085>.
[RoCEv2] "Infiniband Trade Association. Supplement to InfiniBand
architecture specification volume 1 release 1.2.2 annex
A17: RoCEv2 (IP routable RoCE).",
<https://cw.infinibandta.org/document/dl/7781>.
Authors' Addresses
Roni Even
Huawei
Email: roni.even@huawei.com
Mengzhu Liu
Huawei
Email: liumengzhu@huawei.com
Yali Zhang
Huawei
Email: zhangyali369@huawei.com
Even, et al. Expires August 7, 2020 [Page 10]