Internet DRAFT - draft-zhh-tsvwg-open-architecture
draft-zhh-tsvwg-open-architecture
TSVWG Y. Zhuang
Internet-Draft R. Huang
Intended status: Informational Huawei Technologies Co., Ltd.
Expires: January 5, 2020 July 4, 2019
An Open Congestion Control Architecture with network cooperation for
RDMA Fabric
draft-zhh-tsvwg-open-architecture-00
Abstract
This document describes an open congestion control architecture with
network cooperation (including network proactive and passive control)
for high performance RDMA fabric to provide low latency and high
throughput for datacenter applications such as the AI computing.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 5, 2020.
Copyright Notice
Copyright (c) 2019 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Zhuang & Huang Expires January 5, 2020 [Page 1]
Internet-Draft Open architecture July 2019
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . 3
4. Design Principle for high performance RDMA fabric . . . . . . 3
5. Architecture Overview . . . . . . . . . . . . . . . . . . . . 4
5.1. Roles and Functionalities . . . . . . . . . . . . . . . . 6
5.1.1. Sender NIC . . . . . . . . . . . . . . . . . . . . . 6
5.1.2. Switch . . . . . . . . . . . . . . . . . . . . . . . 6
5.1.3. Receiver NIC . . . . . . . . . . . . . . . . . . . . 6
5.2. Interfaces . . . . . . . . . . . . . . . . . . . . . . . 7
5.2.1. NIC interfaces . . . . . . . . . . . . . . . . . . . 8
5.2.2. Network interface . . . . . . . . . . . . . . . . . . 8
6. Compatibility Consideration . . . . . . . . . . . . . . . . . 9
6.1. Negotiate the congestion control capability . . . . . . . 9
6.2. Co-exist with current NIC to NIC control channel . . . . 9
7. Security Considerations . . . . . . . . . . . . . . . . . . . 9
8. Manageability Consideration . . . . . . . . . . . . . . . . . 10
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10
10. References . . . . . . . . . . . . . . . . . . . . . . . . . 10
10.1. Normative References . . . . . . . . . . . . . . . . . . 10
10.2. Informative References . . . . . . . . . . . . . . . . . 10
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11
1. Introduction
Traditionally, RDMA (Remote Direct Memory Access) is running over the
closed and expensive InfiniBand (IB) [IB] networks. However, due to
the limitation of network scalability and high costs of IB, RDMA
traffic is moving to IP/Ethernet as its underlay networks for better
scale and low cost. Supporting RDMA over IP/Ethernet using lower
price NICs and Switches with reduced latency is important for low
latency and high throughput datacenter applications such as AI
Computing.
As such, the datacenter networks (DCNs) nowadays is not only
providing traffic transmission for tenants using TCP/IP network
protocol stack, but also is required to provide RDMA traffic for High
Performance Computing (HPC) and distributed storage accessing
applications which requires low latency and high throughput. With
that said, there are more stringent requirements for basic
performance of DCN.
[Requirement] discusses major problems of current RDMA fabric
technologies and the requirements for better performance. Also,
[HPC] presents the problems of current RDMA fabric from a cloud
operators' perspectives.Based on that, this document proposes an open
Zhuang & Huang Expires January 5, 2020 [Page 2]
Internet-Draft Open architecture July 2019
congestion control architecture of hosts and networks with network
cooperation (including network proactive and passive control) for the
high performance RDMA fabric to provide better congestion control.
The scalability and compatibility of congestion control under the
proposed architecture are also discussed in order to provide
incremental upgrade of the current RDMA technologies.
Discussions of new congestion control algorithms and improved active
queue management (AQM) are out of scope for this document.
2. Conventions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP
14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
3. Abbreviations
IB - InfinitBand
HPC - High Performance Computing
ECN - Explicit Congestion Notification
AI/HPC - Artificial Intelligence/High-Performance computing
RDMA - Remote Direct Memory Access
NIC - Network Interface Card
AQM - Active Queue Management
NP - Notification Point
CP - Congestion Point
RP - Reaction Point
4. Design Principle for high performance RDMA fabric
Based on the [Requirement] and [HPC], the architecture design should
follow some principles:
Zhuang & Huang Expires January 5, 2020 [Page 3]
Internet-Draft Open architecture July 2019
o Can adopt enhancements to provide better performance than existing
technologies, such as better latency, convergence and handling of
packet loss.
o Can support both RoCEv2 and iWARP [RFC5040] as RDMA transports.
o Can support mixture of RDMA traffics and normal TCP traffics.
o Can provide better interoperability between vendors while keep
flexibility.
o Do not modify or provide limited modification to RDMA data plane.
o Be compatible with legacy devices.
o Be easy to deploy new congestion control algorithms.
5. Architecture Overview
The architecture is shown in Figure 1. It composes of hosts (i.e.
sender/receiver NICs) and network nodes (i.e. switches).
Zhuang & Huang Expires January 5, 2020 [Page 4]
Internet-Draft Open architecture July 2019
Sender(RP) Receiver(NP)
'''''''''''''''''''''''''' ''''''''''''''''''''''''''
' +---+ +---+ ' ' +---+ +---+ '
' |CC1| |CC1| ... ' ' |CC1| |CC1| ... '
' +-*-+ +-*-+ ' ' +-*-+ +-*-+ '
' * * ' ' * * '
'+----*------*---------+ ' ' +----*------*---------+'
'| Congestion control | ' Switch(CP and NP) ' | Congestion control |'
'| Engine | ' ' | Engine |'
'+---------------------+ ' ''''''''''''''''' ' +---------------------+'
'+--------++-----------+ ' ' +-----------+ ' ' +-----------++--------+'
'|rate-co ||net-control|<-------- |net-control| ' ' |net-control||rate-co |'
'|ntrol; || | ' ' | | ' ' | ||ntrol; |'
'|loss-re |+-----------+ ' ' +-----------+ ' ' +-----------+|loss-re |'
'|covery |+-----------+ ' ' ' ' +-----------+|covery |'
'| ||nic-control|<........ <........|nic-control|| |'
'| || | ' ' ' ' | || |'
'+--------++-----------+ ' ' ' ' +-----------++--------+'
'+---------------------+ ' ' ' ' +---------------------+'
'| data |========> ======> | data |'
'| | ' ' ' ' | |'
'+---------------------+ ' ' ' ' +---------------------+'
' ' ' ' ' '
'''''''''''''''''''''''''' ''''''''''''''''' ''''''''''''''''''''''''''
<-------- Net2Nic control channel ========> RDMA stream
<........ Nic2Nic control channel ******** System APIs
Figure 1. The open congestion control architecture with network cooperation
Sender and Receiver are both NICs. Within the architecture, the NICs
are proposed to introduce two new interfaces: 1) an interface for the
operators to install/manage congestion control algorithms which can
share the local transmit function blocks such as rate control and
loss recovery to facilitate the deployment of new congestion control
algorithms and the management of different algorithms while
regardless of the detailed hardware implementation; 2) an interface
for net-control module inside network nodes (e.g. switches) to signal
back to senders, and further incorporate the collected information
into the local transmit control.
For the interface to network nodes, we introduce a new NET to NIC
control channel, in which the control message is initiated and sent
by the net-control module inside a switch instead of the receiver.
Since most congestion happens on network nodes, the switch noted as
congestion point (CP) in Fig.1 should be the point aware of the on-
Zhuang & Huang Expires January 5, 2020 [Page 5]
Internet-Draft Open architecture July 2019
going or expected congestion. The advantage of doing so, is to
provide more accurate congestion information and how to prevent or
resolve the congestion based on traffic traversing and resources
allocated on the network nodes directly.
The NIC to NIC control channel signaled by dotted link presents a
logical channel for legacy NIC to NIC control notification. It can
be for example CNP message for RoCEv2 or flags/fields in TCP headers
for iWARP. The RDMA data streams is indicated by bold line and works
as it is. However, some extensions might be needed to implement the
new interfaces which is out of the scope for this document.
5.1. Roles and Functionalities
5.1.1. Sender NIC
As the reaction point (RP) of the architecture, the sender NIC can
deploy/manage the congestion control algorithms based on system
configurations or the negotiation with remote NICs. When congestion
happens, it accordingly adjusts its sending rate based on the used
congestion control algorithm and signaled feedbacks from both the
network nodes and/or the receiver's NIC.
5.1.2. Switch
Switch is the congestion point, which detects the network congestion
based on some metrics, such as queue length or measured latency on
the path or traffic patterns it might have learnt.
For a legacy switch with ECN enabled, it can mark CE in the IP header
of RDMA traffics when congestion exists to notify the receivers.
When the condition is getting worse, it either uses PFC or discard
the packet based on some AQM policies. For legacy switches without
ECN, it discards packets when congestion happens.
For a switch with net-control module, called a net-control switch
here, it can act as the notification point (NP) which can initiate
the control message and delivery it through the NET to NIC control
channel back to the sender, which adjusts its sending rate
accordingly. Net-control switches can be deployed in any places of a
DCN fabric, e.g., TOR or spine.
5.1.3. Receiver NIC
Receiver NIC might negotiate with the sender NIC on the congestion
control capability. It is also the notification point (NP). Based
on the ECN mark or lost packets, it discovers congestion and send
congestion information back to the sender through NIC to NIC control
Zhuang & Huang Expires January 5, 2020 [Page 6]
Internet-Draft Open architecture July 2019
channel to adjust sending rate. In RoCEv2, the CNP message is used
for the NIC to NIC control.
5.2. Interfaces
The architecture introduces two interfaces on NICs and one interface
on the network node for the open control as shown in Figure 2. As
for the NIC, one interface is for deploying/managing different
congestion controls while the other is to communicate with the
network control module on switches. For the switch, the proposed
interface is dedicated for control of network congestions back to the
senders.
Sender(RP)
''''''''''''''''''''''''''
' +---+ +---+ '
' |CC1| |CC1| ... '
' +-*-+ +-*-+ '
' * * '
'+----*------*---------+ '
'| Congestion control | ' Switch(CP and NP)
'| Engine | '
'+---------------------+ ' '''''''''''''''''
'+--------++-----------+ ' ' +-----------+ '
'|rate-co ||net-control|<-------- |net-control| '
'|ntrol; || | ' ' | | '
'|loss-re |+-----------+ ' ' +-----------+ '
'|covery |+-----------+ ' ' '
'| ||nic-control| ' '
'| || | ' ' '
'+--------++-----------+ ' ' '
'+---------------------+ ' ' '
'| data | ' '
'| | ' ' '
'+---------------------+ ' ' '
' ' ' '
'''''''''''''''''''''''''' '''''''''''''''''
<-------- Net2Nic interface ******** system CC interface
Figure 2. Imported NIC interfaces and network interface
Zhuang & Huang Expires January 5, 2020 [Page 7]
Internet-Draft Open architecture July 2019
5.2.1. NIC interfaces
To cope with various scenarios and facilitate the deployment of new
congestion control algorithms, it would be good if NICs will be able
to deploy congestion controls and further manage and configure them
in a common way. The idea to provide a system CC interface is that
the cloud operators can deploy/manage congestion control algorithms
on NICs based on the traffic patterns as well as the network
resources. Then the NICs might negotiate the congestion control
capability with each other.
The function blocks within in the NIC are logic components, not
indicating any specific implementation. A congestion control engine
acts as a platform to provide a system CC interface to deploy
different CCs and then map to local actions and communicate with
local function blocks to provide congestion control operations.
Ideally, local functions related to congestion controls will be
implemented as function blocks and interact with each other through
internal interfaces to achieve the final congestion controls. As
such, CCs can share common local operations and it would be easy for
developers to develop and deploy new CCs regardless of detailed local
implementations. The design of the CC Engine and local function
blocks are out of scope for this document. An example of the design
and implantation can be found in [HotCocoa] .
For now, the local function blocks can include rate-control and loss-
recovery, as well as two blocks to deal with congestion control
information from the interface to NIC control and the interface to
NET control respectively.
The other proposed interface of the NIC is to the NET control
(Net2Nic control channel), which is used to collect congestion
information from the network nodes to be further incorporated to the
congestion control of sender NICs.
5.2.2. Network interface
To achieve more accurate congestion control and ways to prevent or
resolve the congestion based on traffic traversing, as indicated in
Figure 2, the net-control switch will provide a network interface
(Net2Nic interface), by which net-control module inside the node can
signal back to the senders.
The definition of Net2Nic control channel messages and processes are
out of scope for this document. It relies on the design of net-
control module which is responsible for dealing with network
congestions and exposing what precise information to the sender.
Zhuang & Huang Expires January 5, 2020 [Page 8]
Internet-Draft Open architecture July 2019
6. Compatibility Consideration
6.1. Negotiate the congestion control capability
The host might negotiate their supported congestion control
capability during the session setup phase.
However, it should use the existing way of congestion control as
default to provide compatibility with legacy devices.
The net-control switches should be capable of both legacy control and
NET to NIC control. The capability negotiation between NICs and
Switches can be considered either some in-band ECN-like negotiations
or out-of-band individual message negotiations.
6.2. Co-exist with current NIC to NIC control channel
In this architecture, NET to NIC control channel can co-exist with
NIC to NIC control channel. It can be an additional control channel
for better congestion control.
Once the NET-to-NIC channel of a sender is enabled on a switch, it
will signal the congestion information back to the sender through
this channel. While for hosts without NET control, the switch works
the same as the legacy switches when congestion happens.
For receivers that detect the congestion based on lost packets,
packets marked CE due to congestion on legacy network nodes, or the
exhaustion of local resources, they can still notify the senders
according to the congestion control algorithms. The senders evaluate
the messages based on its local polices, e.g., if it receives a
message from the net-control interface prior to the message from the
receiver in certain period, it may decide to make decision based on
the net-control message; While if there's no net-control message
received, the sender may react according to the message from the
receiver.
Please note that NET to NIC control channel SHOULD be implemented as
an option rather than a mandatory feature.
7. Security Considerations
TBD
Zhuang & Huang Expires January 5, 2020 [Page 9]
Internet-Draft Open architecture July 2019
8. Manageability Consideration
TBD
9. IANA Considerations
No IANA action
10. References
10.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/info/rfc8174>.
10.2. Informative References
[HotCocoa]
Arashloo, M. T., Ghobadi, M., Rexford, J., and D. Walker,
"HotCocoa: Hardward Congestion Control Abstractions", 11
2017, <https://www.cs.princeton.edu/~jrex/papers/
hotcocoa17.pdf>.
[HPC] Cardona, O., "Towards Hyperscale High Performance
Computing with RDMA", 6 2019,
<https://pc.nanog.org/static/published/meetings/NANOG76/19
99/20190612_Cardona_Towards_Hyperscale_High_v1.pdf>.
[IB] "Infiniband Trade Association. InfiniBandTM Architecture
Specification Volume 1 and Volume 2.",
<https://cw.infinibandta.org/document/dl/7781>.
[Requirement]
Chen, F., Sun, W., Yu, X., and R. Even, "Data Center
Congestion Management requirements", 6 2019,
<https://datatracker.ietf.org/doc/html/
draft-yueven-tsvwg-dccm-requirements>.
[RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
of Explicit Congestion Notification (ECN) to IP",
RFC 3168, DOI 10.17487/RFC3168, September 2001,
<https://www.rfc-editor.org/info/rfc3168>.
Zhuang & Huang Expires January 5, 2020 [Page 10]
Internet-Draft Open architecture July 2019
[RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
Garcia, "A Remote Direct Memory Access Protocol
Specification", RFC 5040, DOI 10.17487/RFC5040, October
2007, <https://www.rfc-editor.org/info/rfc5040>.
Authors' Addresses
Yan Zhuang
Huawei Technologies Co., Ltd.
Email: zhuangyan.zhuang@huawei.com
Rachel Huang
Huawei Technologies Co., Ltd.
Email: rachel.huang@huawei.com
Zhuang & Huang Expires January 5, 2020 [Page 11]