Internet DRAFT - draft-xu-lsr-fare

draft-xu-lsr-fare







Network Working Group                                              X. Xu
Internet-Draft                                              China Mobile
Intended status: Standards Track                                   Z. He
Expires: 28 August 2024                                         Broadcom
                                                                 J. Wang
                                                                  Centec
                                                                H. Huang
                                                                  Huawei
                                                                Q. Zhang
                                                                     H3C
                                                                   H. Wu
                                                         Ruijie Networks
                                                                  Y. Liu
                                                                  Y. Xia
                                                                 Tencent
                                                                 P. Wang
                                                                   Baidu
                                                                S. Hegde
                                                                 Juniper
                                                        25 February 2024


                    Fully Adaptive Routing Ethernet
                          draft-xu-lsr-fare-02

Abstract

   Large language models (LLMs) like ChatGPT have become increasingly
   popular in recent years due to their impressive performance in
   various natural language processing tasks.  These models are built by
   training deep neural networks on massive amounts of text data, often
   consisting of billions or even trillions of parameters.  However, the
   training process for these models can be extremely resource-
   intensive, requiring the deployment of thousands or even tens of
   thousands of GPUs in a single AI training cluster.  Therefore, three-
   stage or even five-stage CLOS networks are commonly adopted for AI
   networks.  The non-blocking nature of the network become increasingly
   critical for large-scale AI models.  Therefore, adaptive routing is
   necessary to dynamically load balance traffic to the same destination
   over multiple ECMP paths, based on network capacity and even
   congestion information along those paths.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].




Xu, et al.               Expires 28 August 2024                 [Page 1]

Internet-Draft                    FARE                     February 2024


Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 28 August 2024.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   4
   3.  Solution Description  . . . . . . . . . . . . . . . . . . . .   4
     3.1.  Adaptive Routing in 3-stage CLOS  . . . . . . . . . . . .   4
     3.2.  Adaptive Routing in 5-stage CLOS  . . . . . . . . . . . .   5
   4.  Modifications to OSPF and ISIS Behavior . . . . . . . . . . .   7
   5.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .   7
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   7
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .   8
   8.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   8
     8.1.  Normative References  . . . . . . . . . . . . . . . . . .   8
     8.2.  Informative References  . . . . . . . . . . . . . . . . .   8
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   9





Xu, et al.               Expires 28 August 2024                 [Page 2]

Internet-Draft                    FARE                     February 2024


1.  Introduction

   Large language models (LLMs) like ChatGPT have become increasingly
   popular in recent years due to their impressive performance in
   various natural language processing tasks.  These models are built by
   training deep neural networks on massive amounts of text data, often
   consisting of billions or even trillions of parameters.  However, the
   training process for these models can be extremely resource-
   intensive, requiring the deployment of thousands or even tens of
   thousands of GPUs in a single AI training cluster.  Therefore, three-
   stage or even five-stage CLOS networks are commonly adopted for AI
   networks.  Furthermore, In rail-optimized CLOS topologies with
   standard GPU servers (HB domain of eight GPUs), the Nth GPUs of each
   server in a group of servers are connected to the Nth leaf switch,
   which provides higher bandwidth and non-blocking connectivity between
   the GPUs in the same rail.  In rail-optimized topology, most traffic
   between GPU servers would traverse the intra-rail networks rather
   than the inter-rail networks.

   The non-blocking nature of the network, especially the network for
   intra-rail communication, become increasingly critical for large-
   scale AI models.  AI workloads tend to be extremely bandwidth-hungry
   and they usually generate a few elephant flows simultaneously.  If
   the traditional hash-based ECMP load-balancing was used without any
   optimization, it's highly possible to cause serious congestion and
   high latency in the network once multiple elephant flows are routed
   to the same link.  Since the job completion time depends on worst-
   case performance, serious congestion will result in model training
   time longer than expected.  Therefore, adaptive routing is necessary
   to dynamically load balance traffic to the same destination over
   multiple ECMP paths, based on network capacity and even congestion
   information along those paths.  In other words, adaptive routing is a
   capacity-aware and even congestion-aware path selection algorithm.

   Furthermore, to reduce the congestion risk to the maximum extent, the
   routing should be more granular if possible.  Flow-granular adaptive
   routing still has a certain statistical possibility of congestion.
   Therefore, packet-granular adaptive routing is more desirable
   although packet spray would cause out-of-order delivery issue.  A
   flexible reordering mechanism must be put in place(e.g., egress ToRs
   or the receiving servers).  Recent optimizations for RoCE and newly
   invented transport protocols as alternatives to RoCE no longer
   require handling out-of-order delivery at the network layer.
   Instead, the message processing layer is used to address it.

   To enable adaptive routing, no matter whether flow-granular or
   packet-granular adaptive routing, it is necessary to propagate
   network topology information, including link capacity and/or even



Xu, et al.               Expires 28 August 2024                 [Page 3]

Internet-Draft                    FARE                     February 2024


   available link capacity (i.e., link capacity minus link load) across
   the CLOS network.  Therefore, it seems straightforward to use link-
   state protocols such as OSPF or ISIS as the underlay routing protocol
   in the CLOS network, instead of BGP, for propagating link capacity
   information and/or even available link capacity information by using
   OSPF or ISIS TE Metric or Extended TE Metric [RFC3630] [RFC7471]
   [RFC5305] [RFC7810].  More specifically, the Maximum Link Bandwidth
   sub-TLV and Unidirectional Utilized Bandwidth Sub-TLV could be used
   for advertising the link capacity and available link capacity
   information.

   For information on resolving flooding issues caused by link-state
   protocols in large CLOS networks, please refer to the following draft
   [I-D.xu-lsr-flooding-reduction-in-clos].

   Note that while adaptive routing especially at the packet-granular
   level can help reduce congestion between switches in the network,
   thereby achieving a non-blocking fabric, it does not address the
   incast congestion issue which is commonly experienced in last-hop
   switches that are connected to the receivers in many-to-one
   communication patterns.  Therefore, a congestion control mechanism is
   always necessary between the sending and receiving servers to
   mitigate such congestion.

2.  Terminology

   This memo makes use of the terms defined in [RFC2328] and [RFC1195].

3.  Solution Description


3.1.  Adaptive Routing in 3-stage CLOS


       +----+ +----+ +----+ +----+
       | S1 | | S2 | | S3 | | S4 |  (Spine)
       +----+ +----+ +----+ +----+

       +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+
       | L1 | | L2 | | L3 | | L4 | | L5 | | L6 | | L7 | | L8 |  (Leaf)
       +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+


                                  Figure 1

   (Note that the diagram above does not include the connections between
   nodes.  However, it can be assumed that leaf nodes are connected to
   every spine node in their CLOS topology.)



Xu, et al.               Expires 28 August 2024                 [Page 4]

Internet-Draft                    FARE                     February 2024


   In a three-stage CLOS network as shown in Figure 1, also known as a
   leaf-spine network, all nodes MAY be in OSPF area zero or ISIS Level-
   2.

   Leaf nodes are enabled for adaptive routing for OSPF area zero or
   ISIS Level-2.

   When a leaf node, such as L1, calculates the shortest path to a
   specific IP prefix originated by another leaf node in the same OSPF
   area or ISIS Level-2 area, say L2, four equal-cost multi-path (ECMP)
   routes will be created via four spine nodes: S1, S2, S3, and S4.  To
   enable adaptive routing, weight values based on link capacity or even
   available link capacity associated with upstream and downstream links
   SHOULD be considered for global load-balancing.  In particular, the
   minimum value between the capacity of upstream link (e.g., L1->S1)
   and the capacity of downstream link (S1->L2) of a given path (e.g.,
   L1->S1->L2) is used as a weight value for that path when performing
   weighted ECMP load-balancing.

3.2.  Adaptive Routing in 5-stage CLOS































Xu, et al.               Expires 28 August 2024                 [Page 5]

Internet-Draft                    FARE                     February 2024


     =========================================
     # +----+ +----+ +----+ +----+           #
     # | L1 | | L2 | | L3 | | L4 | (Leaf)    #
     # +----+ +----+ +----+ +----+           #
     #                                PoD-1  #
     # +----+ +----+ +----+ +----+           #
     # | S1 | | S2 | | S3 | | S4 | (Spine)   #
     # +----+ +----+ +----+ +----+           #
     =========================================

     ===============================     ===============================
     # +----+ +----+ +----+ +----+ #     # +----+ +----+ +----+ +----+ #
     # |SS1 | |SS2 | |SS3 | |SS4 | #     # |SS1 | |SS2 | |SS3 | |SS4 | #
     # +----+ +----+ +----+ +----+ #     # +----+ +----+ +----+ +----+ #
     #   (Super-Spine@Plane-1)     #     #   (Super-Spine@Plane-4)     #
     #============================== ... ===============================

     =========================================
     # +----+ +----+ +----+ +----+           #
     # | S1 | | S2 | | S3 | | S4 | (Spine)   #
     # +----+ +----+ +----+ +----+           #
     #                                PoD-8  #
     # +----+ +----+ +----+ +----+           #
     # | L1 | | L2 | | L3 | | L4 | (Leaf)    #
     # +----+ +----+ +----+ +----+           #
     =========================================

                                Figure 2

   (Note that the diagram above does not include the connections between
   nodes.  However, it can be assumed that the leaf nodes in a given PoD
   are connected to every spine node in that PoD.  Similarly, each spine
   node (e.g., S1) is connected to all super-spine nodes in the
   corresponding PoD-interconnect plane (e.g., Plane-1).)

   For a five-stage CLOS network as illustrated in Figure 2, each Pod
   consisting of leaf and spine nodes is configured as an OSPF non-zero
   area or an ISIS Level-1 area.  The PoD-interconnect plane consisting
   of spine and super-spine nodes is configured as an OSPF area zero or
   an ISIS Level-2 area.  Therefore, spine nodes play the role of OSPF
   area border routers or ISIS Level-1-2 routers.










Xu, et al.               Expires 28 August 2024                 [Page 6]

Internet-Draft                    FARE                     February 2024


   In rail-optimized topology, Intra-rail communication with high
   bandwidth requirements would be restricted to a single PoD.  Inter-
   rail communication with lower bandwidth requirements can traverse
   across PoDs through the PoD-interconnect planes.  Therefore, enabling
   adaptive routing only in PoD networks is sufficient.  In particular,
   only leaf nodes are enabled for adaptive routing in their associated
   OSPF non-zero area or ISIS Level-1 area.

   When a leaf node within a given PoD (a.k.a., in a given OSPF non-zero
   area or ISIS Level-1 area), such as L1 in PoD-1, calculates the
   shortest path to a specific IP prefix originated by another leaf node
   in the same PoD, say L2 in PoD-1, four equal-cost multi-path (ECMP)
   routes will be created via four spine nodes: S1, S2, S3, and S4 in
   the same PoD.  To enable adaptive routing, weight values based on
   link capacity or even available link capacity associated with
   upstream and downstream links SHOULD be considered for global load-
   balancing.  In particular, the minimum value between the capacity of
   upstream link (e.g., L1->S1) and the capacity of downstream link
   (e.g., S1->L2) of a given path (e.g., L1->S1->L2) is used as a weight
   value of that path.

4.  Modifications to OSPF and ISIS Behavior

   Once an OSPF or ISIS router is enabled for adaptive routing, the
   capacity or even the available capacity of the SPF path SHOULD be
   calculated as a weight value for global load-balancing purposes.

   When advertising the available link capacity metric alongside the
   link capacity metric, it is important to maintain adaptive routing
   stable enough.  To achieve this, a threshold SHOULD be set for the
   available link capacity fluctuation to avoid frequent LSA or LSP
   advertisements.  That's to say, it's useful to avoid sending any
   update that would otherwise be triggered by a minor available link
   capacity fluctuation below that threshold.  More specifically, the
   announcement suppression mechanisms as defined in Sec 5, 6 and 7 of
   [RFC7810] can be applied here.

5.  Acknowledgements

   TBD.

6.  IANA Considerations

   TBD.







Xu, et al.               Expires 28 August 2024                 [Page 7]

Internet-Draft                    FARE                     February 2024


7.  Security Considerations

   TBD.

8.  References

8.1.  Normative References

   [RFC1195]  Callon, R., "Use of OSI IS-IS for routing in TCP/IP and
              dual environments", RFC 1195, DOI 10.17487/RFC1195,
              December 1990, <https://www.rfc-editor.org/info/rfc1195>.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC2328]  Moy, J., "OSPF Version 2", STD 54, RFC 2328,
              DOI 10.17487/RFC2328, April 1998,
              <https://www.rfc-editor.org/info/rfc2328>.

   [RFC5340]  Coltun, R., Ferguson, D., Moy, J., and A. Lindem, "OSPF
              for IPv6", RFC 5340, DOI 10.17487/RFC5340, July 2008,
              <https://www.rfc-editor.org/info/rfc5340>.

8.2.  Informative References

   [I-D.xu-lsr-flooding-reduction-in-clos]
              Xu, X., "Flooding Reduction in CLOS Networks", Work in
              Progress, Internet-Draft, draft-xu-lsr-flooding-reduction-
              in-clos-01, 21 November 2023,
              <https://datatracker.ietf.org/doc/html/draft-xu-lsr-
              flooding-reduction-in-clos-01>.

   [RFC3630]  Katz, D., Kompella, K., and D. Yeung, "Traffic Engineering
              (TE) Extensions to OSPF Version 2", RFC 3630,
              DOI 10.17487/RFC3630, September 2003,
              <https://www.rfc-editor.org/info/rfc3630>.

   [RFC5305]  Li, T. and H. Smit, "IS-IS Extensions for Traffic
              Engineering", RFC 5305, DOI 10.17487/RFC5305, October
              2008, <https://www.rfc-editor.org/info/rfc5305>.

   [RFC7471]  Giacalone, S., Ward, D., Drake, J., Atlas, A., and S.
              Previdi, "OSPF Traffic Engineering (TE) Metric
              Extensions", RFC 7471, DOI 10.17487/RFC7471, March 2015,
              <https://www.rfc-editor.org/info/rfc7471>.




Xu, et al.               Expires 28 August 2024                 [Page 8]

Internet-Draft                    FARE                     February 2024


   [RFC7810]  Previdi, S., Ed., Giacalone, S., Ward, D., Drake, J., and
              Q. Wu, "IS-IS Traffic Engineering (TE) Metric Extensions",
              RFC 7810, DOI 10.17487/RFC7810, May 2016,
              <https://www.rfc-editor.org/info/rfc7810>.

Authors' Addresses

   Xiaohu Xu
   China Mobile
   Email: xuxiaohu_ietf@hotmail.com


   Zongying He
   Broadcom
   Email: zongying.he@broadcom.com


   Junjie Wang
   Centec
   Email: wangjj@centec.com


   Hongyi Huang
   Huawei
   Email: hongyi.huang@huawei.com


   Qingliang Zhang
   H3C
   Email: zhangqingliang@h3c.com


   Hang Wu
   Ruijie Networks
   Email: wuhang@ruijie.com.cn


   Yadong Liu
   Tencent
   Email: zeepliu@tencent.com


   Yinben Xia
   Tencent
   Email: forestxia@tencent.com






Xu, et al.               Expires 28 August 2024                 [Page 9]

Internet-Draft                    FARE                     February 2024


   Peilong Wang
   Baidu
   Email: wangpeilong01@baidu.com


   Shraddha Hegde
   Juniper
   Email: shraddha@juniper.net











































Xu, et al.               Expires 28 August 2024                [Page 10]