Internet DRAFT - draft-lyu-rtgwg-coordinated-cm
draft-lyu-rtgwg-coordinated-cm
RTGWG Y. Lyu
Internet-Draft Y. Zhang
Intended status: Standards Track M. Liu
Expires: 22 April 2024 Huawei
20 October 2023
Coordinated Congestion Management
draft-lyu-rtgwg-coordinated-cm-00
Abstract
AI fabric is sensitive to bandwidth. Congestion management,
including congestion control and load balancing, is a main method to
fully utilize network resource. However, current congestion
management mechanism are not coordinated, which leads to throughput
decreasing. This document provides a scheme for coordinating
different congestion management mechanisms. It describes the design
principle, behaviors of network switches and hosts in the scheme, and
gives an example to show end-to-end procedure.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 22 April 2024.
Copyright Notice
Copyright (c) 2023 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
Lyu, et al. Expires 22 April 2024 [Page 1]
Internet-Draft CCM October 2023
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Requirements Language . . . . . . . . . . . . . . . . . . . . 4
4. Existing congestion management . . . . . . . . . . . . . . . 4
5. Design principle of coordinated congestion management . . . . 5
6. Coordinated congestion management scheme . . . . . . . . . . 6
6.1. Coordination tag . . . . . . . . . . . . . . . . . . . . 6
6.2. Notification message . . . . . . . . . . . . . . . . . . 6
6.3. Behavior of network switches . . . . . . . . . . . . . . 7
6.3.1. Identify congestion type . . . . . . . . . . . . . . 7
6.3.2. Notify CC congestion . . . . . . . . . . . . . . . . 7
6.3.3. Notify upstream point to perform AR . . . . . . . . . 8
6.3.4. Perform congestion control . . . . . . . . . . . . . 8
6.3.5. Perform adaptive routing . . . . . . . . . . . . . . 8
6.4. Behavior of source hosts . . . . . . . . . . . . . . . . 9
7. An example of end-to-end procedure . . . . . . . . . . . . . 9
8. Security Considerations . . . . . . . . . . . . . . . . . . . 11
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11
10. References . . . . . . . . . . . . . . . . . . . . . . . . . 11
10.1. Normative References . . . . . . . . . . . . . . . . . . 11
10.2. Informative References . . . . . . . . . . . . . . . . . 11
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12
1. Introduction
ML/AI has been progressing rapidly over the last decade. ML/AI model
compute, which is measured in FLOPs, are constantly increasing. It
is imperative to employ distributed parallel training to train such
large models in AI cluster.
The communication in AI cluster is bandwidth sensitive. Analyzing
data parallelism and model parallelism which are the 2 acceleration
methods in AI training, it shows an on-off type of burst traffic
pattern with huge traffic amount in each iteration.
Therefore, it is important that AI fabric should provide high
effective bandwidth, so to shorten communication time and improve
computation efficiency. Effective bandwidth indicates fully
utilization of link bandwidth to achieve high throughput. Congestion
management is the key technology, including congestion control
mechanisms and load balancing mechanisms.
Lyu, et al. Expires 22 April 2024 [Page 2]
Internet-Draft CCM October 2023
This document discusses the uncoordinated mechanisms in current
congestion management. That leads to throughput issues which are
particularly harmful in AI fabric. A scheme for coordinating
different congestion management mechanisms is provided in this
document, which can be effectively and widely deployed in AI fabric.
2. Terminology
* ML: Machine Learning
* AI: Artificial Intelligence
* FLOPs: Floating-Point Operations
* ECN: Explicit Congestion Notification
* AR: Adaptive Routing
* DCQCN: Data center QCN [DCQCN]
* CNP: Congestion Notification Packet
* PLB: Protective Load Balancing [PLB]
* CC: Congestion Control
* ECMP: Equal-cost multi-path routing
* Incast congestion: the congesiton is caused by mutiple sources
sending traffic to the same destination simultaneously.
* Incast flow: the flow causing incast congestion
* Incast traffic: packets in incast flows
* CC congestion: the congestion is casued by incast or by high-speed
port sending traffic to low-speed port
* CC flow: the flow causing CC congestion
* CC traffic: packets in CC flows
Lyu, et al. Expires 22 April 2024 [Page 3]
Internet-Draft CCM October 2023
3. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
4. Existing congestion management
Congestion is usually caused by in-cast traffic and/or imbalanced
network load. Incast traffic is the traffic from multiple source
hosts, but towards to the same destination host. Commonly used
solutions include congestion control algorithms that control sending
rates and load balancing algorithms that adjust paths for traffic.
* The congestion control algorithm, such as DCQCN [DCQCN], Timely
[Timely], identifies network congestion by network status, like
queue length of switch port, end-to-end delay RTT, etc., then
adjust the sending rate at the sender to alleviate congestion.
How to quickly flatten down the rate curve to avoid packet loss
and how to recover the rate for less throughput reduction are
essential to congestion control mechanism in AI fabric.
* Adaptive routing is a way for load balancing. According to
network status, network switch dynamically change traffic path of
a flow in order to fully utilize network resource. Network status
could be indicated by local link status, downstream link status
etc. How to locate the proper new traffic path without back-and-
forth path switching is critical in AI fabric, because each path
switching may increase the systeme complexity, like re-ordering.
Currently, congestion control mechanism and adaptive routing work
independently, without coordination. That results in negative impact
on system performance. For example, when congestion caused by
imbalanced load on network occurs on a switch, both DCQCN and
adaptive routing are activated. ECN in data packets is marked
causing the CNP to be sent back to sender. Thus, sender slows down
the sending rate of the congested flow. Meanwhile, the switch
changes the path for congested flow, traversing the new incoming
packets to a light-loaded path. The result is that the congested
flow is forwarded on the light-loaded path at a low rate. Then,
DCQCN needs some time to recover the sending rate at the new path.
It reduces effective bandwidth and seriously impact computation
efficiency in AI training. Another example, if the congestion is
caused by in-cast traffic, congestion control should be enough.
Additional adaptive routing adjustments not only fail to mitigate
congestion, but may also introduce more out-of-order issue.
Lyu, et al. Expires 22 April 2024 [Page 4]
Internet-Draft CCM October 2023
It is shown that current congestion management cannot efficiently
handle congestion issue in AI fabric. Uncoordinated behaviors reduce
effective network bandwidth which is essential for AI workload.
5. Design principle of coordinated congestion management
Coordinated congestion management is designed to coordinate
congestion control and adaptive routing. Design principle is shown
as below.
* Avoid unnecessary sending rate reduction
AI fabric is bandwidth sensitive. High throughput is extremely
important. Multipath is needed to make full use of network
bandwidth. Slowing down the sending rate while there are still
available paths for traffic will be a waste of network resource,
thereby increasing communication time in AI cluster and reducing
AI training performance.
* Fully use multipath while reducing invalid path switching
While searching for light-loaded paths for load balancing, new
paths should be located quickly and accurately. The new path
should not be restricted to local paths but extends the search to
available paths upstream. Invalid path switching should be
avoided. Invalid path switching includes switching in-cast
traffic as no matter how to switch the traffic path, it will final
get congested on the last hop.
* Reuse current CC algorithm and AR algorithm
There are already a variety of CC algorithm and AR algorithms.
Those can still be used in the congestion management coordination
scheme. The scheme enables CC and AR be triggered coordinately,
adjusting sending rate or switching path depending on different
reasons of congestion.
* Applicable to various topologies
Most AI fabrics use CLOS or FATTREE topologies, but there are also
new studies considering the use of direct topologies, such as
torus, dragonfly, dragonfly+. Some of existing solutions for CC
and AR coordination, e.g PLB [PLB], relies on ECMP which can only
be used in topologies with equal cost paths like CLOS. For those
topologies without equal cost paths, like dragonfly+, such
solutions do not work. The coordination scheme should be
applicable to different topologies.
Lyu, et al. Expires 22 April 2024 [Page 5]
Internet-Draft CCM October 2023
6. Coordinated congestion management scheme
The key to the coordinated congestion management is to identify CC
traffic and non-CC traffic, thereby they are treated differently in
network when congestion occurs. CC flow recognized by network is
notified to the source host and the subsequent packets of the CC flow
are tagged by the source host. This indicates the network switch to
perform CC mechanism on the flow instead of AR. For non-CC traffic,
the network switch first performs AR. Only when AR mechansim cannot
find light-loaded path for switching, the traffic turns to be CC
traffic and CC will be run to alleviate congestion.
Coordinated congestion management requires interaction between
network switches and source hosts, and adds a new tag to data packets
for the coordination. The following sections explain the detail of
the scheme.
6.1. Coordination tag
Coordination tag is inserted into data packets. The tag contains CC
indicator and AR indicator.
* CC indicator: indicates if the packet belongs to a flow which
needs congestion control, such as incast flow .
* AR indicator: indicates the location of upstream AR point where
adaptive routing can be performed. The AR point can be a network
switch or a source host. AR indicator can be an ID, an IP address
or other information which guides how to send a message to the AR
point.
The tag can use in-band telemetry scheme to carry in data packet. A
new method CSIG [I-D.draft-ravi-ippm-csig] may provide another
possibility.
6.2. Notification message
There are 3 types of notification.
* Type 1: congestion control required
Example: Type 1 message is sent from incast congetion switch to
incast flow source host, notifying the source host to tag (set CC
indicator) the packets in the incast flow.
Lyu, et al. Expires 22 April 2024 [Page 6]
Internet-Draft CCM October 2023
* Type 2: congestion control released
Example: When incast congestion is eliminated, the switch sends
type 2 message to corresponding hosts, notfifying the source hosts
to untag CC indicator in the subsequent packets of the
corresponding flow.
* Type 3: upstream AR required
Example: If the switch determins to perform AR upstream, type 3
message is sent to the upstream AR point. The upstream AR point
can be one-hop neighbour of the switch or a point multi-hop away.
The notification message includes source IP, destination IP,
notification type and flow key. Source IP is the ip address of the
switch which sends the notification. Destination IP is the ip
address of the destination which will handle the notification
message. Notification type is one of the above 3 types. Flow key is
the information of the flow to be handled, such as 5-tuple
information.
6.3. Behavior of network switches
6.3.1. Identify congestion type
When congestion is detected, network switch judge whether it is CC
congestion or non-CC congestion. CC congestion includes incast
congestion and congestion caused by high-speed port sending traffic
to low-speed port.
If congestion occurs at the switch egress port, and the switch is the
last-hop switch to destination host, it is determined that the
congestion is incast congestion. The flows causing incast congestion
are identified as incast flow.
There may have other methods to identify congestion type. This
document does not make limitation on that.
6.3.2. Notify CC congestion
When CC congestion is determined by the network switch, it generates
type 1 notification messages for each identified CC flow, and sends
the notification messages to source hosts of the flows. When CC
congestion is eliminated, the switch sends type 2 notification
messages to the source hosts.
Lyu, et al. Expires 22 April 2024 [Page 7]
Internet-Draft CCM October 2023
6.3.3. Notify upstream point to perform AR
When it is determined to perform AR, but network switch cannot do it
locally and AR indicator in the data packet shows availability to do
AR upstream, a type 3 notification message is sent to upstream point
according to AR indicator.
6.3.4. Perform congestion control
Network switch performs congestion control in below cases.
* It is identified as CC congestion.
* It is identified as non-CC congestion, but adaptive routing cannot
be used because there is no available new path for traffic
switching either locally or upstream.
This document does not limit which CC mechanism is performed.
6.3.5. Perform adaptive routing
Network switch performs adaptive routing in below cases.
* The flow is non-CC traffic. CC indicator in data packet is used
to determine if it is CC traffic or non-CC traffic.
* Type 3 notification message is received. According to flow
information in the notification, new path is selected for the
subsequent packets of the flow.
In order to enable upstream AR, it is required to update AR indicator
in data packets hop by hop. When a data packet arrives at the
network switches,
* if there are several local light-loaded paths available for AR on
the switch, the switch updates AR indicator in the data packet to
itself, such as its own ID. Then the switch selects the
appropriate local path to send the data packet. This document
does not define algorithm of local path selection. It depends on
routing strategy on the network switch.
* If there is only one local light-loaded path available for AR,
network switch can only select that path for traffic. AR
indicator in the data packet will not be updated.
* If there is no local light-loaded path, network switch gets
upstream AR availability by reading AR indicator in the data
packet. If AR indicator indicates upstream point can perform AR,
Lyu, et al. Expires 22 April 2024 [Page 8]
Internet-Draft CCM October 2023
network switch generates type 3 notification message and sends it
directly to the corresponding upstream point. Otherwise, network
switch triggers congestion control mechanism, such as set ECN in
data packet.
6.4. Behavior of source hosts
When receiving type 1 notification message, source host sets CC
indicator of the subsequent packets for the corresponding flow.
When receiving type 2 notificiation message, source host unset CC
indicator of the subsequent packets for the corresponding flow.
When receiving type 3 notification message, source host performs AR
on the subsequent packets for the corresponding flow.
When receiving congestion control signals and the CC indicator is
set, source host performs CC on the flow.
7. An example of end-to-end procedure
Network topology is shown in Figure 1. This is a 4 layer fattree
topology. There are n computing racks and m switching racks.
Computing racks have source hosts, layer 1 switches and layer 2
switches. Swithcing racks contain layer 3 and layer 4 switches.
Lyu, et al. Expires 22 April 2024 [Page 9]
Internet-Draft CCM October 2023
Switching Rack 1 Switching Rack m
+---------------+ +---------------+
|L4-1-1...L4-1-e| |L4-m-1...L4-m-e|
| | \ / | | | | \ / | |
| | \ / | | | | \ / | |
| | \/ | | | | \/ | |
| | /\ | |...| | /\ | |
| | / \ | | | | / \ | |
| | / \ | | | | / \ | |
|L3-1-1...L3-1-d| |L3-m-1...L3-m-d|
+--+-----------\ +-/----------+--+
| \ / |
| \ / |
| ...... \/ ...... |
| /\ |
| / \ |
| / \ |
+--+-----------/ \----------+---+
|L2-1-1...L1-1-c| |L2-n-1...L2-n-c|
| | \ / | | | | \ / | |
| | \ / | | | | \ / | |
| | \/ | | | | \/ | |
| | /\ | |... | | /\ | |
| | / \ | | | | / \ | |
| | / \ | | | | / \ | |
|L1-1-1...L1-1-b| |L1-n-1...L1-n-b|
| + + | | + + |
| H-1-1... H-1-a| | H-n-1... H-n-a|
+---------------+ +---------------+
Computing Rack 1 Computing Rack n
Figure 1: Network Topology
* Host H-1-1 in computing rack 1sends out a data packet P1 belonging
to flow F1 to H-n-1 in computing rack n. The value of CC
indicator in the packet tag is not set indicating this packet is
in a non-incast flow. The AR indicator in the packet tag does not
point to any available AR point.
* P1 arrives at switch L1-1-1 in computing rack 1. L1-1-1 has
multiple light-loaded paths for AR. Path from L1-1-1 to L2-1-1 is
selected for P1. AR indicator in P1 tag is updated to L1-1-1.
* P1 arrives at switch L2-1-1. L2-1-1 also has multiple light-
loaded paths for AR. Path from L2-1-1 to L3-1-1 is selected for
P1. AR indicator in P1 tag is updated to L2-1-1.
Lyu, et al. Expires 22 April 2024 [Page 10]
Internet-Draft CCM October 2023
* P1 arrives at switch L3-1-1. L3-1-1 only has one light-loaded
paths. The only path from L3-1-1 to L4-1-1 is selected for P1.
AR indicator in P1 tag keeps to be L2-1-1.
* P1 arrives at switch L4-1-1. L4-1-1 is congested and no local
path available for performing AR. By reading AR indicator in P1,
L4-1-1 sends an type 3 notification to L2-1.
* After receiving AR notification, L2-1-1 switches path from
L2-1-1->L3-1-1 to L2-1-1->L3-m-1 for the new incoming packets of
flow F1.
* After a while, L1-n-1 is congested due to incast. The flow F1 is
identified as incast flow. L1-n-1 sends type 1 notification to
H-1-1.
* By receiving the type 1notification, H-1-1 sets CC indicator of
the subsequent packets of F1 indicating the packets are in a
incast flow. Thus those packets will not be performed AR.
Sending rate of F1 will also be reduced according to congestion
control algorithm.
8. Security Considerations
TBD.
9. IANA Considerations
TBD.
10. References
10.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/rfc/rfc2119>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.
10.2. Informative References
[I-D.draft-ravi-ippm-csig]
Ravi, A., Dukkipati, N., Mehta, N., and J. Kumar,
"Congestion Signaling (CSIG)", Work in Progress, Internet-
Lyu, et al. Expires 22 April 2024 [Page 11]
Internet-Draft CCM October 2023
Draft, draft-ravi-ippm-csig-00, 31 August 2023,
<https://datatracker.ietf.org/doc/html/draft-ravi-ippm-
csig-00>.
[DCQCN] "Congestion Control for Large-Scale RDMA Deployments",
August 2015,
<https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/
p523.pdf>.
[Timely] "TIMELY: RTT-based Congestion Control for the Datacenter",
August 2015,
<https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/
p537.pdf>.
[PLB] "PLB: Congestion Signals are Simple and Effective for
Network Load Balancing", August 2022,
<https://dl.acm.org/doi/pdf/10.1145/3544216.3544226>.
Authors' Addresses
Yunping(Lily) Lyu
Huawei
Email: lvyunping@huawei.com
Yuhan Zhang
Huawei
Email: zhangyuhan6@huawei.com
Mengzhu Liu
Huawei
Email: liumengzhu@huawei.com
Lyu, et al. Expires 22 April 2024 [Page 12]