Internet DRAFT - draft-agt-rtgwg-dragonfly-routing


Routing Area Working Group                                  D. Afanasiev
Intended status: Informational                                 R. Glebov
Expires: 5 September 2024                                         Yandex
                                                             J. Tantsura
                                                            4 March 2024

                    Routing in Dragonfly+ Topologies


   This document provides an overview of Dragonfly+ network topology and
   describes routing implementation for IP networks with Dragonfly+
   topology with support for non-minimal routing.t

Afanasiev, et al.       Expires 5 September 2024
Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   2
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Network Design Requirements . . . . . . . . . . . . . . . . .   4
   4.  Dragonfly Topology  . . . . . . . . . . . . . . . . . . . . .   4
     4.1.  Dragonfly Topology Overview . . . . . . . . . . . . . . .   4
     4.2.  Rouging and Paths in Dragonfly+ . . . . . . . . . . . . .   4
     4.3.  Topology Construction and Graph Wiring  . . . . . . . . .   5
     4.4.  Adaptive Load Balancing . . . . . . . . . . . . . . . . .   5
   5.  Routing and Forwarding  . . . . . . . . . . . . . . . . . . .   6
     5.1.  Forwarding  . . . . . . . . . . . . . . . . . . . . . . .   6
     5.2.  Routing . . . . . . . . . . . . . . . . . . . . . . . . .   7
     5.3.  Scalability and Optimizations . . . . . . . . . . . . . .   7
     5.4.  Failure handling and convergence  . . . . . . . . . . . .   8
     5.5.  Asymmetry and traffic engineering . . . . . . . . . . . .   8
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   8
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .   8
   8.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   8
     8.1.  Normative References  . . . . . . . . . . . . . . . . . .   8
     8.2.  Informative References  . . . . . . . . . . . . . . . . .   8
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   9

1.  Introduction

   Dragonfly [KIM2008] is a high-scalability, low-diameter, cost-
   efficient network topology that provides high bandwidth and large
   path diversity.  Dragonfly topology was originally designed for HPC
   and supercomputing systems and is now adopted in more and more
   supercomputing networks.  Its properties also make it an interesting
   candidate for data center network topology, especially Dragonfly+
   variant [SPHINER2017] with leaf-spine intra-group topology.  But
   building IP networks with Dragonfly+ topology is a non-trivial
   problem because IP networks lack many mechanisms traditionally
   available in HPC interconnection networks.  Specifically , Dragonfly+
   relies heavily on non-minimal routing and adaptive load balancing for
   efficient use of available network capacity.

1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14 [RFC2119] [RFC8174] when, and only when, they appear in all
   capitals, as shown here.

2.  Terminology

   This section introduces the terminology used in this document.

      building block of Dragonfly network, collection of nodes connected
      by local links.  In practical deployments, routers and associated
      end-points belonging to a group are assumed to be compactly

   Local (L) / intra-group link
      Link between routers in the same group.  In Dragonfly+ group is a
      leaf-spine network (bipartite graph) so local links are always
      between leaf and spine.

   Global (G) / inter-group link
      Links between routers from different groups.  Usually long and
      more expensive so it is desirable to minimize the number of global

   Path signature
      Sequence of letters corresponding to types of links in the path,
      e.g.  LGLLGL.

   Local / intra-group network

   Global / inter-group network

      Minimal routing

      Randomized non-minimal routing (valiant load balanced)

      Adaptive routing.  Name is misleading because it has nothing to do
      with disseminating reachability information - it is a mapping
      mechanism that maps traffic to already known paths.

      Universal Globally-Adaptive load-balanced

      UGAL with using local queue information at current router node

      UGAL using global information

      Adaptive Routing Notification

3.  Network Design Requirements

   Network design requirements are largely the same as in [RFC7938].
   The most notable difference is the extensive use of non-minimal

4.  Dragonfly Topology

   Body text

4.1.  Dragonfly Topology Overview

   Dragonfly topology was introduced by Kim et al.  [KIM2008].  It aims
   to decrease the cost and diameter of the network while providing good
   scalability.  Dragonfly is a hierarchical topology that divides
   routers into groups connected by long (inter-group) links in a fully-
   connected global network.  Each group essentially implements high-
   radix virtual router.  Dragonfly is a direct topology, in which every
   router has a set of terminal connections leading to endpoints, and a
   set of topological connections leading to other routers, some from
   the same group and some from the other groups.  While original
   Dragonfly uses fully-connected intra-group topology it doesn't
   prevent using other intra-group topologies.  Different intra-group
   topologies produce different Dragonfly "flavors".  Inter-group
   topology is always fully connected.  Dragonfly+ as proposed in
   [SPHINER2017] relies on an extended group topology in which intra-
   group routers are connected as a bipartite graph (leaf-spine or Clos-
   like topology).  Dragonfly+ is superior to conventional Dragonfly due
   to the significantly larger number of hosts which it is able to
   support.  In addition, Dragonfly+ supports similar or better
   bisectional bandwidth for various traffic patterns and requires
   smaller number of buffers to avoid credit loop deadlocks in lossless
   networks.  Dragonfly+ is a indirect topology where only leaf nodes
   are connect to endpoints.  TODO: spine sizing.

4.2.  Rouging and Paths in Dragonfly+

   In Dragonfly and Dragonfly+ topologies there exists at least one
   direct global link between every pair of groups.  Minimal intergroup
   routes traverse a single global link.  The capacity of minimal routes
   between each pair of groups is lower than the aggregate link capacity
   of hosts in a group.  Therefore, conventional minimal routing is not
   enough to obtain maximal throughput and efficiently support various
   traffic patters.  [KIM2008] introduces the concept of non-minimal
   adaptive routing.  For Dragonfly+ we can define three priority levels

   of inter-group routes.  We use notations of ā€Lā€ and ā€Gā€ below to
   express where the route traverses local or global link, respectively.

   1.  High priority: Minimal route (LGL) - a shortest distance route
       which passes through two spine routers using a single global

   2.  Medium priority: Intermediate spine route (LGGL) - a route which
       traverses an intermediate group, using its spine router, passing
       exactly three spine routers using two global links.

   3.  Low priority: Intermediate leaf route (LGLLGL) - a route which
       traverses an intermediate group using its two spine routers and a
       leaf router, passing exactly four spine routers using two global

   LGLLGL routes normally appear only when some spines are not connected
   to at least one spine in every other group - in this case non-minimal
   routes through intermediate group might need to use different ingress
   and egress spines in the intermediate group.  TODO: discuss
   imbalance, density and LGLLGL routes [WILKE2017]

4.3.  Topology Construction and Graph Wiring

   One possible implementation is described in [WILKE2017].  TODO:
   describe wiring scheme invariant under group rotation (consistent
   renumbering of all groups by the same offset mod number of groups).

4.4.  Adaptive Load Balancing

   While routing and forwarding setup described in this document allows
   to propagate reachability information and install forwarding state
   required for Dragonfly+ topologies, including non-minimal paths, it's
   not enough to efficiently use Dragonfly network capacity, especially
   in presence of LGLLGL paths.  Efficient traffic to paths mapping in
   Dragonfly network can not be described by static mechanisms because
   ideally we would like to

   *  fill paths starting from high priority

   *  try to move flows from congested paths as a possible reaction to

   This requires dynamic adaptive load balancing and coupling between
   adaptive load balancing and congestion control.  Adaptive load
   balancing MUST be able to work without complete knowledge of network
   link utilization and queue state since such state can significantly
   change over the period of several RTTs and collecting and

   distributing global network utilization information often enough in
   any network of practically interesting size in infeasible.  Adaptive
   routing can also work as a complementary failure handling mechanism
   with much faster reaction time than routing convergence.  TODO:
   separate document describing possible adaptive load balancing
   implementation using existing mechanisms.

5.  Routing and Forwarding

   This section describes routing design supporting non-minimal paths.
   It uses only existing mechanisms - VRFs, route leaking and EBGP as a
   routing protocol.  EBGP is chosen for scalability and flexibility -
   routing policies and communities allow to implement additional logic
   and precisely control propagation of routing updates.  Routing design
   is based on following principles:

   *  intra-group traffic MUST use minimal routing as group in
      Dragonfly+ is just a leaf-spine network

   *  path can contain at most one transit group

   *  transit spine(s) MUST use shortest path forwarding to avoid
      forwarding loops

   *  LGLLGL paths require traffic reflection via leaves in the transit
      group but only appear if number of uplinks per spine is less than
      number of remote groups

5.1.  Forwarding

   To achieve desired forwarding behavior several VRFs are configured on
   every spine:

   *  local VRF in each group containing local links

   *  core VRF containing all global links

   Additional VRF serving as a virtual link is configured if network is
   using LGLLGL paths - "reflect" VRF in each group containing local
   links.  Since both local VRF and reflect VRF include leaf-spine links
   some form of VRF multiplexing over leaf-spine links is required when
   LGLLGL paths are used.  Additional VRF serving as a virtual link is
   configured if network is using LGLLGL paths - reflect VRF in each
   group containing local links.  Since both local VRF and reflect VRF
   include leaf-spine links some form of VRF multiplexing over leaf-
   spine links is required when LGLLGL paths are used.  Local VRF: -
   imports minimal and non-minmal paths from the core VRF and installs
   them Core VRF - imports locally originated paths from local VRF in
   each group - imports transit paths from reflect VRF Reflect VRF -
   imports minimal paths from `core VRF

5.2.  Routing

   Each group is in a separate AS.  Communities, routing policies and
   update propagation:

   *  When a announcing a route originated in the local group towards
      other groups add community C1

   *  When propagating announce with community C1 add community C2

   *  Do not propagate updates with community C2

   *  Import routes with C1 and C2 into local VRFs

   *  Import routes with C1 only into reflect VRFs, add community C3

   *  Import routes with C3 from reflect VRFs into core VRF

   During import into local VRFs prepend ASPATH:

   *  2 times for routes with C1 only

   *  1 time for routes with C2

   *  do not prepend for routes with C3

   As result paths with C1, C2 and C3 will all have has the same ASPATH
   length in local VRFs and will be eligible for ECMP.

5.3.  Scalability and Optimizations


5.4.  Failure handling and convergence


5.5.  Asymmetry and traffic engineering

   Body text

6.  IANA Considerations

   This memo includes no request to IANA.

7.  Security Considerations

   This document should not affect the security of the Internet.

8.  References

8.1.  Normative References

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <>.

8.2.  Informative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,

   [RFC7938]  Bradner, S., "Use of BGP for Routing in Large-Scale Data
              Centers", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March
              1997, <>.

   [KIM2008]  Kim, J., Dally, W. J., Scott, S., and D. Abts,
              "Technology-Driven, Highly-Scalable Dragonfly Topology",
              2008, <>.

              Shpiner, A., Haramaty, Z., Eliad, S., Zdornov, V., Gafni,
              B., and E. Zahavi, "Dragonfly+: Low Cost Topology for
              Scaling Datacenters", February 2017,

              Flajslik, M., Borch, E., and M. A. Parker, "Megafly: A
              Topology for Exascale Systems", May 2018,

              J, W. J., Sebastien, R., and T. M. Yee, "Design space
              exploration of the Dragonfly topology", 2017,

              Arjun, S., "Load-balanced routing in interconnection
              networks", 2005,

Authors' Addresses

   Dmitry Afanasiev

   Roman Glebov

   Jeff Tantsura

