Internet DRAFT - draft-xu-lsr-fare
draft-xu-lsr-fare
Network Working Group X. Xu
Internet-Draft China Mobile
Intended status: Standards Track Z. He
Expires: 28 August 2024 Broadcom
J. Wang
Centec
H. Huang
Huawei
Q. Zhang
H3C
H. Wu
Ruijie Networks
Y. Liu
Y. Xia
Tencent
P. Wang
Baidu
S. Hegde
Juniper
25 February 2024
Fully Adaptive Routing Ethernet
draft-xu-lsr-fare-02
Abstract
Large language models (LLMs) like ChatGPT have become increasingly
popular in recent years due to their impressive performance in
various natural language processing tasks. These models are built by
training deep neural networks on massive amounts of text data, often
consisting of billions or even trillions of parameters. However, the
training process for these models can be extremely resource-
intensive, requiring the deployment of thousands or even tens of
thousands of GPUs in a single AI training cluster. Therefore, three-
stage or even five-stage CLOS networks are commonly adopted for AI
networks. The non-blocking nature of the network become increasingly
critical for large-scale AI models. Therefore, adaptive routing is
necessary to dynamically load balance traffic to the same destination
over multiple ECMP paths, based on network capacity and even
congestion information along those paths.
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
Xu, et al. Expires 28 August 2024 [Page 1]
Internet-Draft FARE February 2024
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 28 August 2024.
Copyright Notice
Copyright (c) 2024 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Solution Description . . . . . . . . . . . . . . . . . . . . 4
3.1. Adaptive Routing in 3-stage CLOS . . . . . . . . . . . . 4
3.2. Adaptive Routing in 5-stage CLOS . . . . . . . . . . . . 5
4. Modifications to OSPF and ISIS Behavior . . . . . . . . . . . 7
5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7
7. Security Considerations . . . . . . . . . . . . . . . . . . . 8
8. References . . . . . . . . . . . . . . . . . . . . . . . . . 8
8.1. Normative References . . . . . . . . . . . . . . . . . . 8
8.2. Informative References . . . . . . . . . . . . . . . . . 8
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9
Xu, et al. Expires 28 August 2024 [Page 2]
Internet-Draft FARE February 2024
1. Introduction
Large language models (LLMs) like ChatGPT have become increasingly
popular in recent years due to their impressive performance in
various natural language processing tasks. These models are built by
training deep neural networks on massive amounts of text data, often
consisting of billions or even trillions of parameters. However, the
training process for these models can be extremely resource-
intensive, requiring the deployment of thousands or even tens of
thousands of GPUs in a single AI training cluster. Therefore, three-
stage or even five-stage CLOS networks are commonly adopted for AI
networks. Furthermore, In rail-optimized CLOS topologies with
standard GPU servers (HB domain of eight GPUs), the Nth GPUs of each
server in a group of servers are connected to the Nth leaf switch,
which provides higher bandwidth and non-blocking connectivity between
the GPUs in the same rail. In rail-optimized topology, most traffic
between GPU servers would traverse the intra-rail networks rather
than the inter-rail networks.
The non-blocking nature of the network, especially the network for
intra-rail communication, become increasingly critical for large-
scale AI models. AI workloads tend to be extremely bandwidth-hungry
and they usually generate a few elephant flows simultaneously. If
the traditional hash-based ECMP load-balancing was used without any
optimization, it's highly possible to cause serious congestion and
high latency in the network once multiple elephant flows are routed
to the same link. Since the job completion time depends on worst-
case performance, serious congestion will result in model training
time longer than expected. Therefore, adaptive routing is necessary
to dynamically load balance traffic to the same destination over
multiple ECMP paths, based on network capacity and even congestion
information along those paths. In other words, adaptive routing is a
capacity-aware and even congestion-aware path selection algorithm.
Furthermore, to reduce the congestion risk to the maximum extent, the
routing should be more granular if possible. Flow-granular adaptive
routing still has a certain statistical possibility of congestion.
Therefore, packet-granular adaptive routing is more desirable
although packet spray would cause out-of-order delivery issue. A
flexible reordering mechanism must be put in place(e.g., egress ToRs
or the receiving servers). Recent optimizations for RoCE and newly
invented transport protocols as alternatives to RoCE no longer
require handling out-of-order delivery at the network layer.
Instead, the message processing layer is used to address it.
To enable adaptive routing, no matter whether flow-granular or
packet-granular adaptive routing, it is necessary to propagate
network topology information, including link capacity and/or even
Xu, et al. Expires 28 August 2024 [Page 3]
Internet-Draft FARE February 2024
available link capacity (i.e., link capacity minus link load) across
the CLOS network. Therefore, it seems straightforward to use link-
state protocols such as OSPF or ISIS as the underlay routing protocol
in the CLOS network, instead of BGP, for propagating link capacity
information and/or even available link capacity information by using
OSPF or ISIS TE Metric or Extended TE Metric [RFC3630] [RFC7471]
[RFC5305] [RFC7810]. More specifically, the Maximum Link Bandwidth
sub-TLV and Unidirectional Utilized Bandwidth Sub-TLV could be used
for advertising the link capacity and available link capacity
information.
For information on resolving flooding issues caused by link-state
protocols in large CLOS networks, please refer to the following draft
[I-D.xu-lsr-flooding-reduction-in-clos].
Note that while adaptive routing especially at the packet-granular
level can help reduce congestion between switches in the network,
thereby achieving a non-blocking fabric, it does not address the
incast congestion issue which is commonly experienced in last-hop
switches that are connected to the receivers in many-to-one
communication patterns. Therefore, a congestion control mechanism is
always necessary between the sending and receiving servers to
mitigate such congestion.
2. Terminology
This memo makes use of the terms defined in [RFC2328] and [RFC1195].
3. Solution Description
3.1. Adaptive Routing in 3-stage CLOS
+----+ +----+ +----+ +----+
| S1 | | S2 | | S3 | | S4 | (Spine)
+----+ +----+ +----+ +----+
+----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
| L1 | | L2 | | L3 | | L4 | | L5 | | L6 | | L7 | | L8 | (Leaf)
+----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
Figure 1
(Note that the diagram above does not include the connections between
nodes. However, it can be assumed that leaf nodes are connected to
every spine node in their CLOS topology.)
Xu, et al. Expires 28 August 2024 [Page 4]
Internet-Draft FARE February 2024
In a three-stage CLOS network as shown in Figure 1, also known as a
leaf-spine network, all nodes MAY be in OSPF area zero or ISIS Level-
2.
Leaf nodes are enabled for adaptive routing for OSPF area zero or
ISIS Level-2.
When a leaf node, such as L1, calculates the shortest path to a
specific IP prefix originated by another leaf node in the same OSPF
area or ISIS Level-2 area, say L2, four equal-cost multi-path (ECMP)
routes will be created via four spine nodes: S1, S2, S3, and S4. To
enable adaptive routing, weight values based on link capacity or even
available link capacity associated with upstream and downstream links
SHOULD be considered for global load-balancing. In particular, the
minimum value between the capacity of upstream link (e.g., L1->S1)
and the capacity of downstream link (S1->L2) of a given path (e.g.,
L1->S1->L2) is used as a weight value for that path when performing
weighted ECMP load-balancing.
3.2. Adaptive Routing in 5-stage CLOS
Xu, et al. Expires 28 August 2024 [Page 5]
Internet-Draft FARE February 2024
=========================================
# +----+ +----+ +----+ +----+ #
# | L1 | | L2 | | L3 | | L4 | (Leaf) #
# +----+ +----+ +----+ +----+ #
# PoD-1 #
# +----+ +----+ +----+ +----+ #
# | S1 | | S2 | | S3 | | S4 | (Spine) #
# +----+ +----+ +----+ +----+ #
=========================================
=============================== ===============================
# +----+ +----+ +----+ +----+ # # +----+ +----+ +----+ +----+ #
# |SS1 | |SS2 | |SS3 | |SS4 | # # |SS1 | |SS2 | |SS3 | |SS4 | #
# +----+ +----+ +----+ +----+ # # +----+ +----+ +----+ +----+ #
# (Super-Spine@Plane-1) # # (Super-Spine@Plane-4) #
#============================== ... ===============================
=========================================
# +----+ +----+ +----+ +----+ #
# | S1 | | S2 | | S3 | | S4 | (Spine) #
# +----+ +----+ +----+ +----+ #
# PoD-8 #
# +----+ +----+ +----+ +----+ #
# | L1 | | L2 | | L3 | | L4 | (Leaf) #
# +----+ +----+ +----+ +----+ #
=========================================
Figure 2
(Note that the diagram above does not include the connections between
nodes. However, it can be assumed that the leaf nodes in a given PoD
are connected to every spine node in that PoD. Similarly, each spine
node (e.g., S1) is connected to all super-spine nodes in the
corresponding PoD-interconnect plane (e.g., Plane-1).)
For a five-stage CLOS network as illustrated in Figure 2, each Pod
consisting of leaf and spine nodes is configured as an OSPF non-zero
area or an ISIS Level-1 area. The PoD-interconnect plane consisting
of spine and super-spine nodes is configured as an OSPF area zero or
an ISIS Level-2 area. Therefore, spine nodes play the role of OSPF
area border routers or ISIS Level-1-2 routers.
Xu, et al. Expires 28 August 2024 [Page 6]
Internet-Draft FARE February 2024
In rail-optimized topology, Intra-rail communication with high
bandwidth requirements would be restricted to a single PoD. Inter-
rail communication with lower bandwidth requirements can traverse
across PoDs through the PoD-interconnect planes. Therefore, enabling
adaptive routing only in PoD networks is sufficient. In particular,
only leaf nodes are enabled for adaptive routing in their associated
OSPF non-zero area or ISIS Level-1 area.
When a leaf node within a given PoD (a.k.a., in a given OSPF non-zero
area or ISIS Level-1 area), such as L1 in PoD-1, calculates the
shortest path to a specific IP prefix originated by another leaf node
in the same PoD, say L2 in PoD-1, four equal-cost multi-path (ECMP)
routes will be created via four spine nodes: S1, S2, S3, and S4 in
the same PoD. To enable adaptive routing, weight values based on
link capacity or even available link capacity associated with
upstream and downstream links SHOULD be considered for global load-
balancing. In particular, the minimum value between the capacity of
upstream link (e.g., L1->S1) and the capacity of downstream link
(e.g., S1->L2) of a given path (e.g., L1->S1->L2) is used as a weight
value of that path.
4. Modifications to OSPF and ISIS Behavior
Once an OSPF or ISIS router is enabled for adaptive routing, the
capacity or even the available capacity of the SPF path SHOULD be
calculated as a weight value for global load-balancing purposes.
When advertising the available link capacity metric alongside the
link capacity metric, it is important to maintain adaptive routing
stable enough. To achieve this, a threshold SHOULD be set for the
available link capacity fluctuation to avoid frequent LSA or LSP
advertisements. That's to say, it's useful to avoid sending any
update that would otherwise be triggered by a minor available link
capacity fluctuation below that threshold. More specifically, the
announcement suppression mechanisms as defined in Sec 5, 6 and 7 of
[RFC7810] can be applied here.
5. Acknowledgements
TBD.
6. IANA Considerations
TBD.
Xu, et al. Expires 28 August 2024 [Page 7]
Internet-Draft FARE February 2024
7. Security Considerations
TBD.
8. References
8.1. Normative References
[RFC1195] Callon, R., "Use of OSI IS-IS for routing in TCP/IP and
dual environments", RFC 1195, DOI 10.17487/RFC1195,
December 1990, <https://www.rfc-editor.org/info/rfc1195>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC2328] Moy, J., "OSPF Version 2", STD 54, RFC 2328,
DOI 10.17487/RFC2328, April 1998,
<https://www.rfc-editor.org/info/rfc2328>.
[RFC5340] Coltun, R., Ferguson, D., Moy, J., and A. Lindem, "OSPF
for IPv6", RFC 5340, DOI 10.17487/RFC5340, July 2008,
<https://www.rfc-editor.org/info/rfc5340>.
8.2. Informative References
[I-D.xu-lsr-flooding-reduction-in-clos]
Xu, X., "Flooding Reduction in CLOS Networks", Work in
Progress, Internet-Draft, draft-xu-lsr-flooding-reduction-
in-clos-01, 21 November 2023,
<https://datatracker.ietf.org/doc/html/draft-xu-lsr-
flooding-reduction-in-clos-01>.
[RFC3630] Katz, D., Kompella, K., and D. Yeung, "Traffic Engineering
(TE) Extensions to OSPF Version 2", RFC 3630,
DOI 10.17487/RFC3630, September 2003,
<https://www.rfc-editor.org/info/rfc3630>.
[RFC5305] Li, T. and H. Smit, "IS-IS Extensions for Traffic
Engineering", RFC 5305, DOI 10.17487/RFC5305, October
2008, <https://www.rfc-editor.org/info/rfc5305>.
[RFC7471] Giacalone, S., Ward, D., Drake, J., Atlas, A., and S.
Previdi, "OSPF Traffic Engineering (TE) Metric
Extensions", RFC 7471, DOI 10.17487/RFC7471, March 2015,
<https://www.rfc-editor.org/info/rfc7471>.
Xu, et al. Expires 28 August 2024 [Page 8]
Internet-Draft FARE February 2024
[RFC7810] Previdi, S., Ed., Giacalone, S., Ward, D., Drake, J., and
Q. Wu, "IS-IS Traffic Engineering (TE) Metric Extensions",
RFC 7810, DOI 10.17487/RFC7810, May 2016,
<https://www.rfc-editor.org/info/rfc7810>.
Authors' Addresses
Xiaohu Xu
China Mobile
Email: xuxiaohu_ietf@hotmail.com
Zongying He
Broadcom
Email: zongying.he@broadcom.com
Junjie Wang
Centec
Email: wangjj@centec.com
Hongyi Huang
Huawei
Email: hongyi.huang@huawei.com
Qingliang Zhang
H3C
Email: zhangqingliang@h3c.com
Hang Wu
Ruijie Networks
Email: wuhang@ruijie.com.cn
Yadong Liu
Tencent
Email: zeepliu@tencent.com
Yinben Xia
Tencent
Email: forestxia@tencent.com
Xu, et al. Expires 28 August 2024 [Page 9]
Internet-Draft FARE February 2024
Peilong Wang
Baidu
Email: wangpeilong01@baidu.com
Shraddha Hegde
Juniper
Email: shraddha@juniper.net
Xu, et al. Expires 28 August 2024 [Page 10]