Internet DRAFT - draft-chen-iccrg-rocev3-cm-requirements
draft-chen-iccrg-rocev3-cm-requirements
INTERNET-DRAFT F. Chen
Intended Status: Informational W. Sun
Expires: Sep 22, 2019 X. Yu
Huawei Technologies
Mar 21, 2019
Requirements for RoCEv3 Congestion Management
draft-chen-iccrg-rocev3-cm-requirements-00
Abstract
On IP-routed datacenter networks, RDMA is deployed using RoCEv2
protocol. RoCEv2 specification does not define the strong congestion
management mechanisms and load balancing methods. RoCEv2 relies on
the existing Link-Layer Flow-Control IEEE 802.1Qbb(Priority-based
Flow Control, PFC)to provide a lossless fabric. RoCEv2 Congestion
Management(RCM) use ECN(Explicit Congestion Notification, defined in
RFC3168) to signal the congestion to the destination and use the
congestion notification to reduce the rate of injection and increase
the injection rate when the extent of congestion decreases. More and
more practice of congestion management for RoCEv2 appear in the
industry, such as DCQCN(Data Center Quantized Congestion
Notification). There is a demanding for the new RoCEv3 protocol to
provide stronger congestion management and load balancing mechanisms
for RDMA deployment in modern datacenter.
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress".
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
<Chen, et al.> Expires <Sep 22, 2019> [Page 1]
INTERNET DRAFT <RoCEv3 CM Requirements> <Mar 21, 2019>
Copyright and License Notice
Copyright (c) 2018 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . 3
4 RoCEv3 congestion management requirements . . . . . . . . . . . 4
5 Current Congestion Management for RoCEv2 . . . . . . . . . . . 4
5.1 PFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5.2 ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
6. Congestion Management Practice . . . . . . . . . . . . . . . . 5
6.1 Packet Retransmission . . . . . . . . . . . . . . . . . . . 5
6.2 Congestion Control Mechanisms . . . . . . . . . . . . . . . 5
6.3 Re-ordering . . . . . . . . . . . . . . . . . . . . . . . . 6
6.4 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . 6
7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
8 Security Considerations . . . . . . . . . . . . . . . . . . . . 7
9 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 7
10 References . . . . . . . . . . . . . . . . . . . . . . . . . . 7
[EVILBIT] Bellovin, S., "The Security Flag in the IPv4 Header",
RFC 3514, April 1 2003. . . . . . . . . . . . . . . . n
<Chen, et al.> Expires <Sep 22, 2019> [Page 2]
INTERNET DRAFT <RoCEv3 CM Requirements> <Mar 21, 2019>
1 Introduction
With the emerging Distributed Storage, AI/HPC, Machine Learning,
etc., modern datacenter applications demand high throughput(40Gbps
and above) with ultra-low latency of < 10 microsecond per hop from
the network, with low CPU overhead. Remote Direct Memory Access
(RDMA) can meet these needs on Ethernet.
On IP-routed datacenter networks, RDMA is deployed using RoCEv2
protocol. RoCEv2 is a straightforward extension of the RoCE protocol
that involves a simple modification of the RoCE packet format. RoCEv2
packets carry an IP header which allows traversal of IP L3 Routers
and a UDP header that serves as a stateless encapsulation layer for
the RDMA Transport Protocol Packets over IP[1].
RoCEv2 Congestion Management (RCM) provides the capability to avoid
congestion hot spots and optimize the throughput of the fabric. RCM
relies on the existing Link-Layer Flow-Control IEEE 802.1Qbb(PFC) to
provide a drop free network. RoCEv2 Congestion Management(RCM) also
use ECN(RFC3168) to signal the congestion to the destination and use
the congestion notification to reduce the rate of injection and
increase the injection rate when the extent of congestion decreases.
More and more practice of congestion management for RoCEv2 appear in
the industry, such as DCQCN, etc. Shall we consider to develop next
Generation RoCE protocol(alias RoCEv3) with stronger congestion
management and load balancing mechanisms for RDMA deployment in
modern datacenter?
2 Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
3 Abbreviations
RCM - RoCEv2 Congestion Management
PFC - Priority-based Flow Control
ECN - Explicit Congestion Notification
DCQCN - Data Center Quantized Congestion Notification
AI/HPC - Artificial Intelligence/High-Performance computing
ECMP - Equal-Cost Multipath
<Chen, et al.> Expires <Sep 22, 2019> [Page 3]
INTERNET DRAFT <RoCEv3 CM Requirements> <Mar 21, 2019>
4 RoCEv3 congestion management requirements
Network congestion happens in the network switches when the incoming
traffic is larger than the bandwidth of the outgoing link on which it
has to be transmitted. Congestion is the primary source of loss and
in the network, congestion leads to dramatic performance degradation.
Generally, RoCEv2 relies on Link-Layer Flow-Control IEEE
802.1Qbb(PFC) to provide a lossless underlying networks. Lossless
networks implement a mechanism of flow control, which pauses the
traffic with priority granularity in the incoming link before the
buffer overfills, and by that prevents case of dropping packets[2].
However, PFC can lead to poor application performance due to problems
like head-of-line blocking and unfairness[3]. In order to avoid the
problems involved by PFC, there is another faction research on the
congestion control mechanisms over the lossy network.
We need a kind of protocol temporarily named RoCEv3 with stronger
capability of congestion management to achieve the high throughput
and low latency in the large-scale datacenter network with more
flexible requirement to the underlay network. The interoperability is
also required among the industry practice.
5 Current Congestion Management for RoCEv2
5.1 PFC
RDMA is deployed using the RoCEv2 protocol, which relies on IEEE
802.1Qbb Priority-based Flow Control (PFC) to enable a drop-free
network.
PFC is a link level protocol that allows a receiver to assert flow
control telling the transmitter to pause sending traffic for a
specified priority. However, because PFC will stop all traffic in a
particular traffic class at the ingress port, the flows destined to
other ports will also be blocked.
The known problems of PFC are head-of-line blocking, unfairness,
deadlock[4].
5.2 ECN
Explicit congestion notification (ECN) enables end-to-end congestion
notification between two endpoints on TCP/IP based networks. ECN
notifies networks about congestion with the goal of reducing packet
loss and delay by making the sending device decrease the transmission
rate until the congestion clears, without dropping packets. RFC 3168,
The Addition of Explicit Congestion Notification (ECN) to IP, defines
<Chen, et al.> Expires <Sep 22, 2019> [Page 4]
INTERNET DRAFT <RoCEv3 CM Requirements> <Mar 21, 2019>
ECN.
6. Congestion Management Practice
6.1 Packet Retransmission
NICs were not designed to deal with losses efficiently. Receiver
discards out-of-order packets. Sender does go-back-N on detecting
packet loss. RoCEv2 adopt Go-back-N loss recovery and needs lossless
layer 2 (by using PFC) for good performance[5].
If new RDMA protocol does not rely on the lossless layer 2 network,
an efficient method of Packet Retransmission is necessary.
6.2 Congestion Control Mechanisms
6.2.1 RTT-based Congestion Control
The typical practice of RTT based Congestion Control is TIMELY[6]. It
introduces the simple packet delay, measured as round-trip times at
hosts, is an effective congestion signal without the need for switch
feedback. TIMELY measures RTT with microsecond accuracy, and that
these RTTs are sufficient to estimate switch queueing. TIMELY can
adjust transmission rates using RTT gradients to keep packet latency
low while delivering high bandwidth. TIMELY is a delay-based
congestion control protocol for use in the datacenter.
Because the RDMA transport is in the NIC and sensitive to packet
drops, so PFC is necessary because drops hurt performance badly. That
is to say TIMELY needs PFC to provide lossless underlay network.
6.2.2 Credit-based Congestion Control
ExpressPass[7] is an end-to-end credit-scheduled, delay-bounded
congestion control for datacenters. ExpressPass uses credit packets
to control congestion even before sending data packets, which enables
to achieve bounded delay and fast convergence. It uses end-to-end
credit transfer for bandwidth allocation and fine-grained packet
scheduling.
6.2.3 ECN-based Congestion Control
Data Center Quantized Congestion Notification (DCQCN)[3] is an end-
to-end congestion control scheme for RoCEv2. DCQCN is a combination
of ECN and PFC to support end-to-end lossless Ethernet. The idea
behind DCQCN is to allow ECN to do flow control by decreasing the
transmission rate at the sender when congestion starts, thereby
minimizing the time PFC is triggered.
<Chen, et al.> Expires <Sep 22, 2019> [Page 5]
INTERNET DRAFT <RoCEv3 CM Requirements> <Mar 21, 2019>
Although RoCEv2 standard[1] does not list DCQCN as the RCM mechanism,
but it is widely used in the industry practice.
6.3 Re-ordering
When the packets arrive at the destination out-of-order, the
destination should store the packets to restore the order.
Destination should assign special buffer resource to perform re-
ordering. There are many methods to implement the re-ordering either
on switch or on NIC side. Here will not go into the details.
6.4 Load Balancing
6.4.1 ECMP
RoCEv2 packets use an opaque flow identifier in their UDP Source Port
field for ECMP method to implement path selection mechanisms for load
balancing and improve utilization of the fabric topology.
Traditional ECMP can not balance loads well in the data center
network because it splits loads at the granularity of flow.
The finer the granularity of load balancing, the more effective the
load balancing is and the higher the utilization of network bandwidth
can be achieved.
6.4.2 Flowlet
The typical Flowlet-based load balancing is CONGA[8]. CONGA is a
network-based distributed congestion-aware load balancing mechanism
for datacenters. It splits TCP flows into flowlets, estimates real-
time congestion on fabric paths, and allocates flowlets to paths
based on feedback from remote switches.
Flowlets are bursts of packets from a flow. The idle interval between
two bursts of packets is larger than the maximum difference in
latency among the paths. So the second burst can be sent along a
different path than the first without reordering packets.
6.4.3 Per-packet
The effect of packet-based load balancing is the best because the
corresponding granularity is the smallest. The consequence is that
packets belonging to the same flow will be allocated to different
paths. When the forwarding delays of paths are different, it is
possible that packets may arrive at the receiver out-of-order.
7 Summary
<Chen, et al.> Expires <Sep 22, 2019> [Page 6]
INTERNET DRAFT <RoCEv3 CM Requirements> <Mar 21, 2019>
The new emerging RoCE based applications urge the practice of
different congestion management mechanisms to be practiced in kinds
of modern large-scale datacenter network. In this problem statement,
not all the mainstream mechanisms are introduced. It is still needed
to extend when considering the future RoCE protocol temporary named
RoCEv3 with robust congestion management capability and more flexible
requirement on layer 2 network which might be the next direction.
8 Security Considerations
This document does not introduce any additional security constraints.
9 IANA Considerations TBD
10 References
[1] Infiniband Trade Association. Supplement to InfiniBand
architecture specification volume 1 release 1.2.2 annex A17: RoCEv2
(IP routable RoCE), 2014.
[2] Understanding RoCEv2 Congestion Management,
https://community.mellanox.com/docs/DOC-2321
[3] Zhu, Yibo, et al. "Congestion Control for Large-Scale RDMA
Deployments." Acm Sigcomm Computer Communication Review
45.5(2015):523-536.
[4] Hu, Shuihai, et al. "Deadlocks in Datacenter Networks: Why Do
They Form, and How to Avoid Them." The, ACM Workshop ACM, 2016:92-98.
[5] Mittal, Radhika, et al. "Revisiting Network Support for RDMA."
(2018).
[6] Mittal, Radhika, et al. "TIMELY: RTT-based Congestion Control for
the Datacenter." ACM Conference on Special Interest Group on Data
Communication ACM, 2015:537-550.
[7] Cho, Inho, D. Han, and K. Jang. "ExpressPass: End-to-End Credit-
based Congestion Control for Datacenters." (2016).
[8] Alizadeh, Mohammad, et al. "CONGA: distributed congestion-aware
load balancing for datacenters." ACM Conference on SIGCOMM ACM,
2014:503-514.
<Chen, et al.> Expires <Sep 22, 2019> [Page 7]
INTERNET DRAFT <RoCEv3 CM Requirements> <Mar 21, 2019>
[KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC1776] Crocker, S., "The Address is the Message", RFC 1776, April
1 1995.
[TRUTHS] Callon, R., "The Twelve Networking Truths", RFC 1925,
April 1 1996.
[EVILBIT] Bellovin, S., "The Security Flag in the IPv4 Header",
RFC 3514, April 1 2003.
[RFC5513] Farrel, A., "IANA Considerations for Three Letter
Acronyms", RFC 5513, April 1 2009.
[RFC5514] Vyncke, E., "IPv6 over Social Networks", RFC 5514, April 1
2009.
Authors' Addresses
Fei Chen
Huawei Technologies Co., Ltd.
Email: chenfei57@huawei.com
Wenhao Sun
Huawei Technologies Co., Ltd.
Email: sam.sunwenhao@huawei.com
Xiang Yu
Huawei Technologies Co., Ltd.
Email: yolanda.yu@huawei.com
<Chen, et al.> Expires <Sep 22, 2019> [Page 8]