Internet DRAFT - draft-chen-nfsv4-rocev2-cm-problem-statement
draft-chen-nfsv4-rocev2-cm-problem-statement
INTERNET-DRAFT F. Chen
Intended Status: Informational W. Sun
Expires: Feb 09, 2019 Huawei Technologies
Aug 10, 2018
Problem Statement of RoCEv2 Congestion Management
draft-chen-nfsv4-rocev2-cm-problem-statement-00
Abstract
On IP-routed datacenter networks, RDMA is deployed using RoCEv2
protocol. RoCEv2 specification does not define the congestion
management and load balancing methods. RoCEv2 relies on the existing
Link-Layer Flow-Control IEEE 802.1Qbb(Priority-based Flow Control,
PFC)to provide a lossless network. RoCEv2 Congestion Management(RCM)
use ECN(Explicit Congestion Notification, defined in RFC3168) to
signal the congestion to the destination and use the congestion
notification to reduce the rate of injection and increase the
injection rate when the extent of congestion decreases. More and more
practice of congestion management for RoCEv2 appear in the industry,
such as DCQCN(Data Center Quantized Congestion Notification). There
is a demanding for the new RoCE protocol(temporary alias RoCEv3) to
provide stronger congestion management and load balancing mechanisms
for RDMA deployment in modern datacenter.
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress".
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
<Chen, et al.> Expires <Feb 9, 2019> [Page 1]
INTERNET DRAFT <Problem Statement of RCM> <Aug 10, 2019>
Copyright and License Notice
Copyright (c) 2018 IETF Trust and the persons identified as the
document authors. All rights reserved.
Copyright (c) 2018 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . 3
4 Problem statement & requirements . . . . . . . . . . . . . . . 4
5 Current Congestion Management for RoCEv2 . . . . . . . . . . . 4
5.1 PFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5.2 ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
6. Congestion Management Practice . . . . . . . . . . . . . . . . 5
6.1 Packet Retransmission . . . . . . . . . . . . . . . . . . . 5
6.2 Congestion Control Mechanisms . . . . . . . . . . . . . . . 5
6.4 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . 6
7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
8 Security Considerations . . . . . . . . . . . . . . . . . . . . 7
9 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 7
10 References . . . . . . . . . . . . . . . . . . . . . . . . . . 7
<Chen, et al.> Expires <Feb 9, 2019> [Page 2]
INTERNET DRAFT <Problem Statement of RCM> <Aug 10, 2019>
1 Introduction
With the emerging Distributed Storage, AI/HPC, Machine Learning,
etc., modern datacenter applications demand high throughput(40Gbps
and above) with ultra-low latency of < 10 microsecond per hop from
the network, with low CPU overhead. Remote Direct Memory Access
(RDMA) can meet these needs on Ethernet.
On IP-routed datacenter networks, RDMA is deployed using RoCEv2
protocol. RoCEv2 is a straightforward extension of the RoCE protocol
that involves a simple modification of the RoCE packet format. RoCEv2
packets carry an IP header which allows traversal of IP L3 Routers
and a UDP header that serves as a stateless encapsulation layer for
the RDMA Transport Protocol Packets over IP[1].
RoCEv2 Congestion Management (RCM) provides the capability to avoid
congestion hot spots and optimize the throughput of the fabric. RCM
relies on the existing Link-Layer Flow-Control IEEE 802.1Qbb(PFC) to
provide a drop free network. RoCEv2 Congestion Management(RCM) also
use ECN(RFC3168) to signal the congestion to the destination and use
the congestion notification to reduce the rate of injection and
increase the injection rate when the extent of congestion decreases.
More and more practice of congestion management for RoCEv2 appear in
the industry, such as DCQCN, etc. Shall we consider to develop next
Generation RoCE protocol(alias RoCEv3) with stronger congestion
management and load balancing mechanisms for RDMA deployment in
modern datacenter?
2 Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
3 Abbreviations
RCM - RoCEv2 Congestion Management
PFC - Priority-based Flow Control
ECN - Explicit Congestion Notification
DCQCN - Data Center Quantized Congestion Notification
AI/HPC - Artificial Intelligence/High-Performance computing
ECMP - Equal-Cost Multipath
<Chen, et al.> Expires <Feb 9, 2019> [Page 3]
INTERNET DRAFT <Problem Statement of RCM> <Aug 10, 2019>
4 Problem statement & requirements
Network congestion happens in the network switches when the incoming
traffic is larger than the bandwidth of the outgoing link on which it
has to be transmitted. Congestion is the primary source of loss and
in the network, congestion leads to dramatic performance degradation.
Generally, RoCEv2 relies on Link-Layer Flow-Control IEEE
802.1Qbb(PFC) to provide a lossless underlying networks. Lossless
networks implement mechanism of flow control, which pauses the
traffic in the incoming link before the buffer overfills, and by that
prevents case of dropping packets[2]. However, PFC can lead to poor
application performance due to problems like head-of-line blocking
and unfairness[3]. In order to avoid the problems involved by PFC,
there is another faction research on the congestion control
mechanisms over the lossy network.
We need a kind of protocol with stronger capability of congestion
management to achieve the high throughput and low latency in the
large-scale datacenter network with more flexible requirement to the
underlay network. The interoperability is also required among the
industry practice.
5 Current Congestion Management for RoCEv2
5.1 PFC
RDMA is deployed using the RoCEv2 protocol, which relies on IEEE
802.1Qbb Priority-based Flow Control (PFC) to enable a drop-free
network.
PFC is a link level protocol that allows a receiver to assert flow
control telling the transmitter to pause sending traffic for a
specified priority. However, because PFC will stop all traffic in a
particular traffic class at the ingress port, the flows destined to
other ports will also be blocked.
The known problems of PFC are head-of-line blocking, unfairness,
deadlock[4].
5.2 ECN
Explicit congestion notification (ECN) enables end-to-end congestion
notification between two endpoints on TCP/IP based networks. ECN
notifies networks about congestion with the goal of reducing packet
loss and delay by making the sending device decrease the transmission
rate until the congestion clears, without dropping packets. RFC 3168,
The Addition of Explicit Congestion Notification (ECN) to IP, defines
ECN.
<Chen, et al.> Expires <Feb 9, 2019> [Page 4]
INTERNET DRAFT <Problem Statement of RCM> <Aug 10, 2019>
6. Congestion Management Practice
6.1 Packet Retransmission
NICs were not designed to deal with losses efficiently. Receiver
discards out-of-order packets. Sender does go-back-N on detecting
packet loss. RoCEv2 adopt Go-back-N loss recovery and needs lossless
layer 2 (by using PFC) for good performance[5].
If new RDMA protocol does not rely on the lossless layer 2 network,
an efficient method of Packet Retransmission is necessary.
6.2 Congestion Control Mechanisms
6.2.1 RTT-based Congestion Control
The typical practice of RTT based Congestion Control is TIMELY[6]. It
introduces the simple packet delay, measured as round-trip times at
hosts, is an effective congestion signal without the need for switch
feedback. TIMELY measures RTT with microsecond accuracy, and that
these RTTs are sufficient to estimate switch queueing. TIMELY can
adjust transmission rates using RTT gradients to keep packet latency
low while delivering high bandwidth. TIMELY is a delay-based
congestion control protocol for use in the datacenter.
Because the RDMA transport is in the NIC and sensitive to packet
drops, so PFC is necessary because drops hurt performance badly. That
is to say TIMELY needs PFC to provide lossless underlay network.
6.2.2 Credit-based Congestion Control
ExpressPass[7] is an end-to-end credit-scheduled, delay-bounded
congestion control for datacenters. ExpressPass uses credit packets
to control congestion even before sending data packets, which enables
to achieve bounded delay and fast convergence. It uses end-to-end
credit transfer for bandwidth allocation and fine-grained packet
scheduling.
6.2.3 ECN-based congestion control
Data Center Quantized Congestion Notification (DCQCN)[3] is an end-
to-end congestion control scheme for RoCEv2. DCQCN is a combination
of ECN and PFC to support end-to-end lossless Ethernet. The idea
behind DCQCN is to allow ECN to do flow control by decreasing the
transmission rate at the sender when congestion starts, thereby
minimizing the time PFC is triggered.
Although RoCEv2 standard[1] does not list DCQCN as the RCM mechanism,
but it is widely used in the industry practice.
<Chen, et al.> Expires <Feb 9, 2019> [Page 5]
INTERNET DRAFT <Problem Statement of RCM> <Aug 10, 2019>
6.3 Re-ordering
When the packets arrive at the destination out-of-order, the
destination should store the packets to restore the order.
Destination should assign special buffer resource to perform re-
ordering. There are many methods to implement the re-ordering either
on switch or on NIC side. Here will not go into the details.
6.4 Load Balancing
6.4.1 ECMP
RoCEv2 packets use an opaque flow identifier in their UDP Source Port
field for ECMP method to implement path selection mechanisms for load
balancing and improve utilization of the fabric topology.
Traditional ECMP can not balance loads well in the data center
network because it splits loads at the granularity of flow.
The finer the granularity of load balancing, the more effective the
load balancing is and the higher the utilization of network bandwidth
can be achieved.
6.4.2 Flowlet
The typical Flowlet-based load balancing is CONGA[8]. CONGA is a
network-based distributed congestion-aware load balancing mechanism
for datacenters. It splits TCP flows into flowlets, estimates real-
time congestion on fabric paths, and allocates flowlets to paths
based on feedback from remote switches.
Flowlets are bursts of packets from a flow. The idle interval between
two bursts of packets is larger than the maximum difference in
latency among the paths. So the second burst can be sent along a
different path than the first without reordering packets.
6.4.3 Per-packet
The effect of packet-based load balancing is the best because the
corresponding granularity is the smallest. The consequence is that
packets belonging to the same flow will be allocated to different
paths. When the forwarding delays of paths are different, it is
possible that packets may arrive at the receiver out-of-order.
7 Summary
The new emerging RoCE based applications urge the practice of
different congestion management mechanisms to be practiced in kinds
of modern large-scale datacenter network. In this problem statement,
not all the mainstream mechanisms are introduced. It is still needed
to extend when considering the future RoCE protocol(temporary alias
<Chen, et al.> Expires <Feb 9, 2019> [Page 6]
INTERNET DRAFT <Problem Statement of RCM> <Aug 10, 2019>
RoCEv3) with robot congestion management capability and more flexible
requirement on layer 2 network which might be the next direction.
8 Security Considerations
This document does not introduce any additional security constraints.
9 IANA Considerations TBD
10 References
[1] Infiniband Trade Association. Supplement to InfiniBand
architecture specification volume 1 release 1.2.2 annex A17: RoCEv2
(IP routable RoCE), 2014.
[2] Understanding RoCEv2 Congestion Management,
https://community.mellanox.com/docs/DOC-2321
[3] Zhu, Yibo, et al. "Congestion Control for Large-Scale RDMA
Deployments." Acm Sigcomm Computer Communication Review
45.5(2015):523-536.
[4] Hu, Shuihai, et al. "Deadlocks in Datacenter Networks: Why Do
They Form, and How to Avoid Them." The, ACM Workshop ACM, 2016:92-98.
[5] Mittal, Radhika, et al. "Revisiting Network Support for RDMA."
(2018).
[6] Mittal, Radhika, et al. "TIMELY: RTT-based Congestion Control for
the Datacenter." ACM Conference on Special Interest Group on Data
Communication ACM, 2015:537-550.
[7] Cho, Inho, D. Han, and K. Jang. "ExpressPass: End-to-End Credit-
based Congestion Control for Datacenters." (2016).
[8] Alizadeh, Mohammad, et al. "CONGA: distributed congestion-aware
load balancing for datacenters." ACM Conference on SIGCOMM ACM,
2014:503-514.
<Chen, et al.> Expires <Feb 9, 2019> [Page 7]
INTERNET DRAFT <Problem Statement of RCM> <Aug 10, 2019>
Authors' Addresses
Fei Chen
Huawei Technologies Co., Ltd.
Email: chenfei57@huawei.com
Wenhao Sun
Huawei Technologies Co., Ltd.
Email: sam.sunwenhao@huawei.com
<Chen, et al.> Expires <Feb 9, 2019> [Page 8]