yao-tsvwg-cco-problem-statement-and-usecases-00.txt

Internet DRAFT - draft-yao-tsvwg-cco-problem-statement-and-usecases
draft-yao-tsvwg-cco-problem-statement-and-usecases

Last Version:	draft-yao-tsvwg-cco-problem-statement-and-usecases-00.txt	Tracker Entry
Date:	`23-Oct-2023`
Disposition:	current





Transport Area Working Group                                      K. Yao
Internet-Draft                                                     S. Xu
Intended status: Informational                              China Mobile
Expires: 25 April 2024                                             Y. Li
                                                                H. Huang
                                                     Huawei Technologies
                                                             D. KUTSCHER
                                                       HKUST (Guangzhou)
                                                         23 October 2023


 Collective Communication Optimization: Problem Statement and Use cases
         draft-yao-tsvwg-cco-problem-statement-and-usecases-00

Abstract

   Collective communication is the basic logical communication model for
   distributed applications.  When distributed systems scales, the
   communication overhead becomes the bottleneck of the entire system,
   impeding system performance to increase.  This draft describes the
   performance challenges when the collective communication is employed
   in a network with more nodes or processes participating in or a
   larger number of such communication rounds required to complete a
   single job.  And the document presents several use cases where
   different aspects of collective communication optimization are
   needed.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."




Yao, et al.               Expires 25 April 2024                 [Page 1]

Internet-Draft  Collective Communication Optimization: P    October 2023


   This Internet-Draft will expire on 25 April 2024.

Copyright Notice

   Copyright (c) 2023 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . .   4
     2.1.  Distributed Training  . . . . . . . . . . . . . . . . . .   4
     2.2.  High-Performance Computing  . . . . . . . . . . . . . . .   5
     2.3.  Distributed Storage Systems . . . . . . . . . . . . . . .   6
     2.4.  Big Data Analysis(MapReduce)  . . . . . . . . . . . . . .   7
   3.  Problem Statement . . . . . . . . . . . . . . . . . . . . . .   8
     3.1.  Collective Message Transport Issues . . . . . . . . . . .  10
       3.1.1.  Reliability . . . . . . . . . . . . . . . . . . . . .  10
       3.1.2.  Semantic Gap Between Message and Packet . . . . . . .  12
       3.1.3.  Blocking and Non-blocking Communications  . . . . . .  14
     3.2.  Control and Management Plane Issues . . . . . . . . . . .  14
       3.2.1.  In-Network Computing Primitives . . . . . . . . . . .  14
       3.2.2.  Topology Awareness  . . . . . . . . . . . . . . . . .  14
     3.3.  One to Group Transmission Issues  . . . . . . . . . . . .  15
   4.  Security Considerations . . . . . . . . . . . . . . . . . . .  15
   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  16
   6.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  16
     6.1.  Normative References  . . . . . . . . . . . . . . . . . .  16
     6.2.  Informative References  . . . . . . . . . . . . . . . . .  17
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  17

1.  Introduction

   Collective communication is the basic logical communication model for
   distributed applications like AI model training and inference, high-
   performance computing, big data analysis and distributed storage.  It
   defines several inter-process communication modes upon which modern
   high performance distributed systems can be built.  There are several
   existing open standards and programming models that have supported



Yao, et al.               Expires 25 April 2024                 [Page 2]

Internet-Draft  Collective Communication Optimization: P    October 2023


   the collective communications like Message Passing Interface(MPI),
   Partitioned Global Address Space Language(PGAS), etc.  However, these
   programming models only focus on application level, which does not
   differentiate the underlying network capabilities.  Existing
   implementation of collective communication employs point-to-
   point(P2P) mode at the network in most cases.  Given the nature of
   collective communication involves one or more senders and one or more
   receivers, it inevitably leads to an obvious communication overhead
   when blunt point to point is used inside these distributed systems.
   There need a more optimized and performant communication mechanism
   that's dedicated designed for collective communication.

   One of the implementation methods of collective communication is
   Remote Direct Memory Access(RDMA) mechanism.  IB network naturally
   supports RDMA, and is the state-of-the-art networking solution for
   collective communication.  To improve collective communication
   performance, IB has some enhancements, like SHArP(Scalable
   Hierachical Aggregation Protocol)[SHArP] and hardware
   multicast[Hardware_Multicast], which could support collective
   operations offloading, saving bandwidth and reducing communication
   latency.

   However, Ethernet-based RDMA does not support such capabilities.
   While hopefully, Ethernet is the most widely used link layer
   protocol, Ethernet-based RDMA should evolve to support collective
   communication offloading.  And there need some optimization in
   transport protocols and application-network co-design to make
   Ethernet-based RDMA more suitable for collective communication.
   These work should be considered in IETF.

   This draft first presents four different use cases to illustrate
   where collective communication is used and how optimized collective
   communication can upgrade distributed application performance.  Then
   the draft points out the fundamental reason that the existing
   protocols cannot meet the high-performance requirements of collective
   communications is that these distributed applications are not co-
   designed with the underlying networking protocols.  There is a
   semantic gap between inter-process message transportation and packet
   forwarding, which should be bridged by efficient mapping and
   optimization.  This draft further analyses the problem from three
   different perspectives.  One is that the current end to end transport
   protocol does not support efficient offloading for collective
   operations.  Secondly, the control and management plane does not
   support specific extensions required by collective operations, like
   network offloading based topology awareness.  The last one is that
   the current implementation of collective communication can not make
   use of the existing IP multicast protocols, which can reduce the
   communication overhead and save the bandwidth.



Yao, et al.               Expires 25 April 2024                 [Page 3]

Internet-Draft  Collective Communication Optimization: P    October 2023


2.  Use Cases

2.1.  Distributed Training

   The Large Language Model(LLM) like chatGPT has introduced a
   phenomenal impact to the industry, leading the digital and
   intelligent transformation of the whole society.  Foundation models
   usually have over trillions of parameters which inevitably need to be
   deployed in a distributed manner for model training and model
   inference.  In foundation models like transformer, the commonly used
   training mode is Mixture of Expert(MoE).  MoE has two basic
   collective operations, AlltoAll and Allreduce.

   In AlltoALL phase, the Gate deployed in each device passes the
   intermediate gradients to Feed Forward Nodes(FFNs) deployed in other
   devices, and FFNs then pass the computation results to the block "Add
   & Normalization" for the next-step computation.  ALLtoALL
   transmission includes group-to-group communication logic, which will
   incur a lot of bandwidth contention and can be optimized.

   In Allreduce phase, the gradients generated from each device need to
   be submitted to a central node for aggregation.  There may have an
   incast problem if the number of distributed nodes is too large and
   the gradients messages take up too much bandwidth.  Offloading
   Allreduce operation to network devices can saving lots of
   transmission bandwidth and obviously it could halve the transmission
   distance and reduce the latency.
























Yao, et al.               Expires 25 April 2024                 [Page 4]

Internet-Draft  Collective Communication Optimization: P    October 2023


              DeVice0               ...               DeVice1
      +--------------------+                  +--------------------+
      |                    |      DeVice0     |                    |
      |  +---------------  |        ...       |  +---------------  |
      |  |self attention|  |      DeViceN     |  |self attention|  |
      |  +--------------+  |                  |  +--------------+  |
      |  +--------------+  |                  |  +--------------+  |
      |  |  Add & Norm  |  |                  |  |  Add & Norm  |  |
      |  +--------------+  |                  |  +--------------+  |
      |  +--------------+  |                  |  +--------------+  |
      |  |   +------+   |  |                  |  |   +------+   |  |
      |  |   | Gate | <------+                +----->+ Gate |   |  |
      |  |   +------+   |  | |   +--------+   |  |   +------+   |  |
      |  |   +------+   |  | +-->+AlltoAll+<--+  |   +------+   |  |
      |  |   | FFN0 | <------+   +--------+   +----->+ FFN0 |   |  |
      |  |   +------+   |  | |                |  |   +------+   |  |
      |  +--------------+  | |   +--------+   |  +--------------+  |
      |  +--------------+  | +-->+AlltoAll+<--+  +--------------+  |
      |  |  Add & Norm  +<---+   +--------+   +->+  Add & Norm  |  |
      |  +--------------+  |                  |  +--------------+  |
      |         |          |                  |         |          |
      +---------|----------+                  +---------|----------+
                |                                       |
                |                +---------+            |
                +--------------> |Allreduce| <----------+
                                 +---------+

                                 Distributed

                 Figure 1: Collective communication in MoE

2.2.  High-Performance Computing

   The basis for HPC is parallel computing.  In modern HPC clusters,
   parallel computing is usually implemented through multiple CPU cores
   programming at the same time, finishing a task concurrently and
   leading to higher performance.  During parallel computing, messages
   are passed between processes across different CPU cores, and this is
   when several collective operations are needed.  In main-worker mode,
   Allreduce happens when the main node gathers messages from several
   workers and computes to get the aggregation results.










Yao, et al.               Expires 25 April 2024                 [Page 5]

Internet-Draft  Collective Communication Optimization: P    October 2023


                                          +-------+
                                +-------> |worker0|
                                |         +-------+
                                |
                                |
                                | Allreduce
                                |
                                |
                       +----+   |         +-------+
                       |Main| <---------> |worker1|
                       +----+   |         +-------+
                                |
                                |
                                | Allreduce
                                |
                                |         +-------+
                                +-------> |worker2|
                                          +-------+

              Figure 2: Main-worker Allreduce in HPC clusters

2.3.  Distributed Storage Systems

   Collective operations like broadcast is also used in distributed
   storage systems.  Primary servers perform data operations such as
   replication and modification and broadcast the message to backup
   servers.  Currently, the broadcast operation is implemented by using
   commodity RDMA Network Interface Card (RNIC) with unicast operations,
   and the performance of RDMA-based data replication is lagged by two
   drawbacks of unicast traffic: data redundancy and independent
   replicating states.  On the one hand, data redundancy in multiple
   replications incurs a large amount of bandwidth waste on the network
   links to replica servers.  On the other hand, the independent
   replicating states reduplicate the CPU by independently posting the
   request, polling the completion, and keeping track of the delivery
   status of each unicast replication, which undoubtedly incurs a lot of
   overhead during transmission.














Yao, et al.               Expires 25 April 2024                 [Page 6]

Internet-Draft  Collective Communication Optimization: P    October 2023


                                 +-------+
                                 |Primary|
                                 +---+---+
                                     |
                                     |    Replica
                                     |   broadcast
                                     v
                                  +--+---+
                      +-----------+Switch+----------+
                      |           +--+---+          |
                      |              |              |
                      |              |              |
                      |              |              |
                      v              v              v
                   +--+---+       +--+---+      +---+--+
                   |Backup|       |Backup|      |Backup|
                   +------+       +------+      +------+

             Figure 3: Broadcast in distributed storage systems

2.4.  Big Data Analysis(MapReduce)

   The main stream distributed big data analysis systems like spark also
   have collective operations.  The most communication performance
   bottleneck happens in Shuffle phase.  The Shuffle phase involves data
   movement across multiple workers and it's originally implemented via
   TCP/IP Socket for distributed communication among processes which
   incurs low performance.  The Shuffle phase can be accelerated via MPI
   Java bindings.  TCP will be replaced by MPI-based new transport for
   high-performance data movement.





















Yao, et al.               Expires 25 April 2024                 [Page 7]

Internet-Draft  Collective Communication Optimization: P    October 2023


                                 before
                                shuffle
                                  +--+ +--+
                                  |c1| |c1|
                        +-------+ |c1| |a1| after shuffle
                        |worker2| |c1| |b1|
                        +-------+ +--+ +--+

                            ^                     +--+ +--+
                            |                     |a1| |a1|
                            |           +-------+ |a1| |c1|
               +------+     +---------> |worker1| |a1| |b1|
               |Master|                 +-------+ +--+ +--+
               +------+
                                            ^
               +------+                     |
               |Driver|                     |
               +------+                     |
                            +---------------^
                            |
                            |
                            |
                            v     +--+ +--+
                                  |b1| |b1|
                        +-------+ |b1| |a1|
                        |worker3| |b1| |c1|
                        +-------+ +--+ +--+

    Figure 4: Collective operations in Shuffle phase of big data systems

3.  Problem Statement

   The demand for computing resource in AI/HPC applications is growing
   rapidly, and single node computing power can not meet it.  Parallel
   computing has become a trend.  Parallel computing is a type of
   computing architecture in which several processors simultaneously
   execute multiple, smaller calculations broken down from an overall
   larger, complex problem.  Collective communication plays a role in
   data aggregation, data distribution, and synchronization in
   distributed computing systems.  The collective communication
   primitives include collective operations like Reduce, All-Reduce,
   Bcast, Alltoall, Scatter, Gather, and synchronization operations like
   Barrier, etc.








Yao, et al.               Expires 25 April 2024                 [Page 8]

Internet-Draft  Collective Communication Optimization: P    October 2023


    +---------------+--------------+---------------------------------+
    |     Type      |  Function    |         Description             |
    +---------------+--------------+---------------------------------+
    |               |  Bcast       | One to group.                   |
    |               |              | One process sends (broadcasts)  |
    |               |              | some data to all the processes  |
    |               |              | in a group.                     |
    |               +--------------+---------------------------------+
    |               |  Gather      | Group to one.                   |
    |               |              | If an array is scattered across |
    |               |              | all processes in the group. And |
    |               |              | one process (root) collects each|
    |               |              | piece of the array into a       |
    |               |              | specified array.                |
    |               +--------------+---------------------------------+
    |      Data     |  Allgather   | All processes, not just the     |
    |    Movement   |              | root, receive the result of     |
    |               |              | Gather.                         |
    |               +--------------+---------------------------------+
    |               |  Scatter     | One-To-Group.                   |
    |               |              | One process distributes the data|
    |               |              | into n segments, where the i-th |
    |               |              | segment is sent to the i-th     |
    |               |              | process in the group which has  |
    |               |              | n processes.                    |
    |               +--------------+---------------------------------+
    |               |  Alltoall    | This is an extension to         |
    |               |              | Allgather.Each process sends    |
    |               |              | distinct data to each receiver. |
    |               |              | The j-th block from process i is|
    |               |              | received by process j and stored|
    |               |              | in the i-th block.              |
    +---------------+--------------+---------------------------------+
    |               |  Reduce      | Group to one.                   |
    |               |              | Used to collect data or partial |
    |               |              | results from multiple processing|
    |               |              | units and to combine them into a|
    |               |              | global result by a chosen       |
    |               |              | operator.                       |
    |               +--------------+---------------------------------+
    |               |  All-Reduce  | Sistribute the result of a      |
    |     Data      |              | Reduce operation to all         |
    |               |              | processes in the group.         |
    |  Aggregation  +--------------+---------------------------------+
    |               |Reduce-Scatter| scattering the result of        |
    |               |              | reduction to all processes      |
    |               +--------------+---------------------------------+
    |               |  Scan        | A Scan operation performs       |



Yao, et al.               Expires 25 April 2024                 [Page 9]

Internet-Draft  Collective Communication Optimization: P    October 2023


    |               |              | partial reductions on           |
    |               |              | distributed data.               |
    +---------------+--------------+---------------------------------+
    |Synchronization|  Barrier     | A synchronous operation to      |
    |               |              | synchronize all processes       |
    |               |              | within a communicator.          |
    +---------------+--------------+---------------------------------+

          Figure 5: Collective Communication of Parallel Computing

3.1.  Collective Message Transport Issues

3.1.1.  Reliability

   Traditional transport layer provides reliable transmission
   mechanisms, like TCP, QUIC, etc., only for point-to-point
   communication, and supports best-effort transmission services for
   multicast and broadcast communication.  Therefore, in order to meet
   the requirement of reliable transmission, the implementations of
   collective operations like Bcast and AlltoAll are often based on
   point-to-point operations at hosts.

   This leads to the waste of host CPU resources caused by repeated
   packaging and sending, while multiple identical data packets on the
   same path also constitute redundant traffic and exacerbate network
   load.  To solve these problems, parallel computing networks need a
   reliable transmission mechanism suitable for collective operations,
   and cooperating with reasonable traffic engineering and congestion
   control mechanisms.

   Collective communication optimization generally uses a network node
   like a switch to perform the collective operation like Reduce.  The
   network node receives the data from multiple senders and computes the
   result based on the provided operator, e.g.  SUM operator.  There are
   two ways in terms of reliability to ensure the correctness of the
   collective operations.

   The first one is to keep using the end-to-end reliability.  The
   intermediate node that performs the 'Reduce' is not considered as an
   end point that participants in the traditional reliability guarantee.
   Refer to the figure below, Host 1 to 3 are workers for a Reduce
   operations, and Host 4 is the parameter server that collects the data
   from all the workers to perform a SUM computing.  The transport
   session is established between each worker and the parameter server.
   Switch in the middle performs the in-network aggregation.  In case
   there is any packet loss between any of the workers and switch or
   between switch and parameter server, end-to-end reliability mechanism
   like re-transmission would be triggered by end points.  The



Yao, et al.               Expires 25 April 2024                [Page 10]

Internet-Draft  Collective Communication Optimization: P    October 2023


   intermediate switch releases the maintance burden of transport
   session and most of the state maintenance.  Consider the scenarios
   where the data packet is small like in some of the high performance
   computing.  The switch simply keeps a single packet from every worker
   until it can make the summation computing from all the workers.  The
   state maintained is minimum.  It should be considered how the
   reliability can be efficiently ensured in terms of the fast loss
   detection and signaling when an intermediate node plays a role.  In
   addition to the reliability considerations, how to ensure the data to
   be aggregated at the correct intermediate node or nodes should be
   considered.  Considering encryption scenarios, ubiquitous connection
   encryption (QUIC etc.) make it hard to employ performance enhancing
   functions.

   The second one is to treat the intermediate switch as an end point.
   The reliability is guaranteed between the worker and switch and
   between switch and parameter server as two independent sessions.
   That would require the switch to support the full transport function
   which includes the session establishments and all the state
   maintenance.  It is expected that the lighter transport session
   maintenance will bring more benefit.  In addition, there should be
   fall back signaling to tolerate the faults.  Considering encryption
   scenarios, If network devices are required to encrypt and decrypt
   data, it will require a lot of resources, maintain a large number of
   sessions, and also involve issues such as maintaining secret keys.

             +-----+
             |Pkt_1|
             +-----+
    +------+
    |Host_1+---------+
    +------+         |
                     |
             +-----+ |                              +-------+
             |Pkt_2| |  +-------------------------+ |Pkt_Agg|
             +-----+ |  |                         | +-------+
    +------+         |  |         Switch          |            +------+
    |Host_2+---------+--> (In-Network Aggregation)+----------->|Host_4|
    +------+         |  |                         |            +------+
                     |  +-------------------------+
             +-----+ |
             |Pkt_3| |
             +-----+ |
    +------+         |
    |Host_3+---------+
    +------+

            Figure 6: In-Network Aggregation Packet Termination



Yao, et al.               Expires 25 April 2024                [Page 11]

Internet-Draft  Collective Communication Optimization: P    October 2023


3.1.2.  Semantic Gap Between Message and Packet

   Collective operations offloading utilizes network devices to achieve
   low computational accuracy and high IO computing tasks, achieving
   optimization of collective operations.  Collective communication
   optimization devices not only complete packet routing and forwarding,
   but also need to process transport layer messages.  Therefore, the
   transport layer needs to complete the mapping mechanism between
   packets and messages, and complete the transmission layer message
   transmission.  Existing transport layers cannot provide this
   function.

   Message size will impact the performance of implementation of
   collective operations.  The packet size usually has an upper bound,
   for example, the jumbo frame of Ethernet is 9.6kB, while message size
   dose not have limitation and it is determined by applications.  For
   network devices, processing large size messages will incur a lot of
   overhead, reflected in deep buffer and its Serialization and
   Deserialization(SerDes) may have much pressure.  In addition, it will
   also impact the message sending rate at the host and thus lower the
   end-to-end system performance.  How to choose the appropriate message
   size and optimize its processing is still a problem.





























Yao, et al.               Expires 25 April 2024                [Page 12]

Internet-Draft  Collective Communication Optimization: P    October 2023


          Host         Traditional        In-Network         Host
                     Network Devices   Network Devices
    +-------------+                                    +--------------+
    |             |                                    |              |
    | +---------+ |                                    | +----------+ |
    | |   App   | |                  +--------------+  | |   App    | |
    | +---------+ |                  |              |  | +----------+ |
    |             |                  |              |  |              |
   ++-------------+------------------+--------------+--+--------------++
   ||             |             Messages            |  |              ||
   || +---------+ |                  | +----------+ |  | +----------+ ||
   || |Transport| |                  | | Message  | |  | | Transport| ||
   || |  Layer  | |                  | |Processing| |  | |  Layer   | ||
   || +---------+ |                  | +----------+ |  | +----------+ ||
   ||             |                  |              |  |              ||
   ++-------------+------------------+--------------+--+--------------++
    |             |                  |              |  |              |
    |             | +--------------+ |              |  |              |
   ++-------------+-+--------------+-+--------------+--+--------------++
   ||             | |              | |              |  |              ||
   || +---------+ | |              | |              |  | +----------+ ||
   || | Network | | |            Packets            |  | |  Network | ||
   || |  Layer  | | |              | |              |  | |   Layer  | ||
   || |  (IB,   | | |              | |              |  | |   (IB,   | ||
   || | TCP/IP) | | | +----------+ | | +----------+ |  | |  TCP/IP) | ||
   || +---------+ | | |          | | | |          | |  | +----------+ ||
   ||             | | |          | | | |          | |  |              ||
   || +---------+ | | |  Packet  | | | |  Packet  | |  | +----------+ ||
   || |  Link   | | | |Forwarding| | | |Forwarding| |  | |   Link   | ||
   || |  Layer  | | | |          | | | |          | |  | |   Layer  | ||
   || |(IB Link,| | | |          | | | |          | |  | | (IB Link,| ||
   || |   Eth)  | | | +----------+ | | +----------+ |  | |    Eth)  | ||
   || +---------+ | |              | |              |  | +----------+ ||
   ||             | |              | |              |  |              ||
   ++-------------+-+--------------+-+--------------+--+--------------++
    +---------+---+ +-----+----+---+ +----+----+----+  +-----+--------+
              |           ^    |          ^    |             ^
              +-----------+    +----------+    +-------------+

      Figure 7: In-Network computing devices need to process transport
                               layer messages










Yao, et al.               Expires 25 April 2024                [Page 13]

Internet-Draft  Collective Communication Optimization: P    October 2023


3.1.3.  Blocking and Non-blocking Communications

   Parallel computing communication operators can be divided into two
   categories based on whether to wait for message feedback results
   before proceeding with the next step of computation: blocking and
   non-blocking.  Blocking collective operations will experience process
   blocking until communication is completed.  Non-blocking operations
   allow communication to be handled by the backend, and communication
   and computation can overlap.  Parallel computing networks need to
   support two types of communication methods simultaneously, and for
   blocking communication, it is even more necessary to use methods such
   as in-network computing technology to to optimize collective
   communication performance.

3.2.  Control and Management Plane Issues

3.2.1.  In-Network Computing Primitives

   After supporting collective operations offloading, collective
   communication library needs to be expanded to not only support end-
   to-end mode, but also end-to-network communication modes, whitch
   means suporting in-network processing/computing functions.

   The implementation of the offloaded collective operations relies on
   the capabilities provided by physical network devices, known as in-
   network computing primitives.  Currently, most online computing
   devices use programmable network devices such as P4(Programming
   Protocol-independent Packet Processors) and NPL(Network Programming
   Language).  These programmable switches are not powerful enough to
   program non-trivial application-layer behavior.  And their platforms
   are too heterogeneous to manage.  We need to standardize the in-
   network computing primitives that network devices can provide,
   including data types, operations, etc.  Based on the standard in-
   network computing primitives, we need to implement parallel in-
   network computing communication operators.

3.2.2.  Topology Awareness

   In data center, fat tree or CLOS topology is widely used.  There are
   some full meshed network topology as well.











Yao, et al.               Expires 25 April 2024                [Page 14]

Internet-Draft  Collective Communication Optimization: P    October 2023


   As the scale of parallel computing clusters continues to increase, it
   is difficult to achieve optimal transmission performance solely
   through automatic convergence of traditional routing protocols.  It
   is necessary to cooperate with topology-aware algorithms to explore
   complex topologies, and complete path planning, traffic optimization,
   etc.  However, existing topology-aware algorithms do not consider
   collective operations offloading scenarios and need to be modified to
   realize end-network cooperation, meeting the requirements of end-to-
   end path planning and traffic optimization.

3.3.  One to Group Transmission Issues

   IP multicast has been designed to support broadcast related
   applications like live streaming and video conferencing.  There are
   some standardized IP multicast protocols, like PIMRFC 7761
   [RFC7761]and BIERRFC 8279 [RFC8279].  However, existing IP multicast
   protocols havn't been made the full use for distributed application
   systems which require collective communication.  Most collective
   operations still use unicast mode for transmission which definitely
   incurs a lot of overhead, reflected in information redundancy,
   bandwidth occupancy and host CPU consumption.  What we need is low
   latency multi-destination delivery (potentially with reliability or
   at least failure detection).  IP-Multicast is more a potential
   solution (maybe not a good one), but still worth trying.  Even though
   existing IP multicast protocols may have their best suited
   application scenarios and they differ in state maintainance and
   multicast tree construction, they are still well worth promoting for
   collective communication scenarios.  Collective operations like Bcast
   and AlltoAll can be augmented by extending IP multicast protocols, as
   well as other composite operations like reduce-scatter and all-
   gather.

4.  Security Considerations

   Collective communication optimization may introduce some security and
   privacy concerns, especially when collective operations offloading is
   needed.














Yao, et al.               Expires 25 April 2024                [Page 15]

Internet-Draft  Collective Communication Optimization: P    October 2023


   On one hand, the distributed nature of computations and the
   involvement of network devices raise issues about data
   confidentiality, integrity, and authentication.  There are some
   potential vulnerabilities when data processed over the network is
   exposed to unauthorized access.  It's sugguested to support both
   security-enabled and security-less deployments, so that limited
   domains[RFC8799] do not have to pay the penalty of expensive crypto
   or authority operations.  Because application and the network within
   limited domains can be mutual trust with each other, since they could
   both belong to the same administrator.  Extending the technology to
   the Internet should be designed together with some intrinsic
   protective actions.

   On the other hand, encrypted data brings challenges and performance
   issues for processing on network devices.. Decrypting and encrypting
   data on network devices is not only inefficient, but also involves
   issues such as key management and authorization.  Processing
   encrypted data may not be applicable to all encryption algorithms,
   and not suitable for all scenarios.

5.  IANA Considerations

   TBD.

6.  References

6.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC7761]  Fenner, B., Handley, M., Holbrook, H., Kouvelas, I.,
              Parekh, R., Zhang, Z., and L. Zheng, "Protocol Independent
              Multicast - Sparse Mode (PIM-SM): Protocol Specification
              (Revised)", STD 83, RFC 7761, DOI 10.17487/RFC7761, March
              2016, <https://www.rfc-editor.org/info/rfc7761>.

   [RFC8279]  Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A.,
              Przygienda, T., and S. Aldrin, "Multicast Using Bit Index
              Explicit Replication (BIER)", RFC 8279,
              DOI 10.17487/RFC8279, November 2017,
              <https://www.rfc-editor.org/info/rfc8279>.

   [RFC8799]  Carpenter, B. and B. Liu, "Limited Domains and Internet
              Protocols", RFC 8799, DOI 10.17487/RFC8799, July 2020,
              <https://www.rfc-editor.org/info/rfc8799>.



Yao, et al.               Expires 25 April 2024                [Page 16]

Internet-Draft  Collective Communication Optimization: P    October 2023


6.2.  Informative References

   [Hardware_Multicast]
              Liu, J., "Fast and Scalable MPI-Level Broadcast using
              InfiniBand's Hardware Multicast Support",
              DOI 10.1109/ipdps.2004.1302912, 2004,
              <https://doi.org/10.1109/ipdps.2004.1302912>.

   [SHArP]    Graham, R. L., "Scalable Hierarchical Aggregation Protocol
              (SHArP): A Hardware Architecture for Efficient Data
              Reduction", DOI 10.1109/COMHPC.2016.006, 2023,
              <https://doi.org/10.1109/COMHPC.2016.006>.

Authors' Addresses

   Kehan Yao
   China Mobile
   Beijing
   100053
   China
   Email: yaokehan@chinamobile.com


   Shiping Xu
   China Mobile
   Beijing
   100053
   China
   Email: xushiping@chinamobile.com


   Yizhou Li
   Huawei Technologies
   Nanjing, Jiangsu
   China
   Email: liyizhou@huawei.com


   Hongyi Huang
   Huawei Technologies
   Beijing
   China
   Email: hongyi.huang@huawei.com








Yao, et al.               Expires 25 April 2024                [Page 17]

Internet-Draft  Collective Communication Optimization: P    October 2023


   Dirk KUTSCHER
   HKUST (Guangzhou)
   Guangzhou
   China
   Email: dku@hkust-gz.edu.cn














































Yao, et al.               Expires 25 April 2024                [Page 18]