Topology-aware Collective Communication in In-Network Computing Enabled
              Network: Problem Statement and Requirements


   In this document, the mapping mechanism between the logical and
   physical topology of collective communication is analysed in In-
   Network Computing(INC) enabled network, as well as the impact of
   topology-aware collective communication algorithms on INC enabled
   large-scale computing clusters.  Requirements are also proposed to
   design efficient mapping mechanism between logical and physical
   topology and topology-aware collective communication algorithms.

1.  Introduction

   Large scale supercomputing systems have witnessed significant growth
   in the recent history.  At the heart of these systems are compute
   nodes based on modern multi-core architectures and high speed
   networks.  These systems offer vast amounts of computing power and
   resources to application developers and are allowing scientific
   applications to scale out to tens of thousands of processes.

   These processes rely on Message Passing Interface (MPI) for
   information exchange and complete parallel computing.  The hardware
   network in reality is a physical network, while the communication
   between processes that are independent of hardware devices is
   abstracted as a logical network.  An important aspect of
   communication in parallel computing is the rational mapping between
   logical network and physical network.  When INC is introduced, the
   network hardware can also join the process of collective
   communication., which in turn will impact the overall communication
   model.  Therefore, In INC enabled large-scale clusters, the mapping
   rules need to be adjusted accordingly.

   In large scale clusters, the network contention can significantly
   impact the performance of applications when the processor allocation
   is scattered across different racks in the cluster.  It is critical
   to discover the topology of such clusters and design collective
   message exchange algorithms that are aware of the topology in order

   to improve the overall performance of real-world applications.  After
   introducing INC, the topology discovery algorithm should not be
   limited to factors such as network structure and bandwidth, but also
   consider factors such as INC capacities and computational load.

2.  Conventions Used in This Document

2.1.  Terminology

   INC In-Network Computing

   MPI Message Passing Interface

2.2.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "OPTIONAL" in this document are to be interpreted as described in BCP
   14[RFC2119][RFC8174] when, and only when, they appear in all
   capitals, as shown here.

3.  Problem Statement

   In traditional mode, computing tasks are completed by computers and
   servers in the cluster, and after enabling INC, some of the computing
   tasks are transferred to network devices.  As a result, for the same
   MPI primitive, compared to traditional mode, after enabling INC, the
   communication subjects in the logical topology can not only be mapped
   to computers, but also to network devices.  At the same time, the
   implementation of certain MPI primitives based on INC may result in
   topological difference compared to traditional patterns.  The current
   topology mapping mechanism does not consider the content above.

   How to use topology-aware algorithms to improve MPI primitive
   communication performance and reduce communication costs in large-
   scale clusters has been a hot research direction.  [TopoIB] presents
   efficient topology-aware algorithms for two collective communication
   primitives and proposed a communication model to analyze the
   communication overhead of large-scale cluster communication.  In
   [Themis], a new scheduling mechanism and topology-aware algorithm are
   proposed from the perspective of improving network bandwidth
   utilization, and it was verified that the network bandwidth
   utilization rate of a single AllReduce operation can be increased by
   1.72 times.  But when INC is enabled, these topology detection
   algorithms will not only be limited to network characteristics such
   as bandwidth and communication overhead, but should simultaneously
   consider the computing and processing capabilies of network devices

   Hence, several problems are raised:

   * How to properly map the communication logical topology subjects to
   the INC enabled physical network subjects?

   * How will enabling INC change the logical network topologies of MPI
   primitives and what challenges will it bring?

   * How do we efficiently discover the topology of an INC enabled large
   scale cluster?

   * What are the challenges involved in designing efficient collective
   algorithms that are aware of the INC enabled network topology?

4.  Requirements

   The topology mapping algorithm between logical and physical networks
   in large-scale clusters enabled by INC, as well as the topology-aware
   collective communication algorithms used to enhance cluster
   communication, need to meet the following requirements:

   * INC enabled communication entities in large-scale clusters MUST not
   only support mapping to computing nodes in physical network, but also
   supporting mapping to network devices in physical network.

   * After introducing INC, logical communication may change.  MPI
   primitives, for example, AllReduce, may correspond to one or more
   logical topologies that support INC.  However, from the aspect of
   computation results, the implementation of logical topology that
   supports INC MUST be equivalent to traditional methods.

   * Topology detection algorithms in large-scale clusters that enable
   INC not only need to consider network factors such as communication
   overhead and path bandwidth, but also consider the INC capability and
   computational load of network devices, such as SINC

   * The topology-aware collective communication algorithm SHOULD
   consider the network path load as well as the impact of background
   traffic on cluster communication performance in INC enabled large-
   scale clusters.

   * A reasonable evaluation model for INC enabled large-scale cluster
   is REQUIRED, taking into account the factors such as connectivity
   status and computing capabilities in network devices.

   * The topology mapping algorithm and topology detection algorithm
   SHOULD support the fallback mechanism, which can remap the logical
   network to the traditional mode and achieve path detection after an
   INC failure.

7.  Informative References

              Lou, Z., Iannone, L., Li, Y., Zhangcuimin, and K. Yao,
              "Signaling In-Network Computing operations (SINC)", Work
              in Progress, Internet-Draft, draft-lou-rtgwg-sinc-00, 7
              June 2023, <

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,

   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
              May 2017, <>.

   [Themis]   Rashidi S, Won W, Srinivasan S, et al., "Themis: A Network
              Bandwidth-Aware Collective Scheduling Policy for
              Distributed Training of DL Models", May 2021,

   [TopoIB]   Kandalla K C, Subramoni H, Vishnu A, et al., "Designing
              topology-aware collective communication algorithms for
              large scale InfiniBand clusters: Case studies with Scatter
              and Gather", December 2010,

Authors' Addresses

   Shiping Xu
   China Mobile

   Kehan Yao
   China Mobile

