Internet DRAFT - draft-liu-opsawg-cco-cm-requirement

draft-liu-opsawg-cco-cm-requirement







Operations and Management Area Working Group                      C. Liu
Internet-Draft                                                     S. Xu
Intended status: Informational                              China Mobile
Expires: 1 September 2024                               29 February 2024


   Requirements from Control and Management Viewpoint for Collective
                       Communication Optimization
                 draft-liu-opsawg-cco-cm-requirement-01

Abstract

   Collective communication optimization is crucial to improve the
   performance of distributed applications, due to that communication
   has become bottleneck to degrade applications with the growth of
   scale of distributed systems.  The industry and academy has worked on
   proposing solutions to upgrade collective communication operations.
   However, there has been a problem of lacking for unified guidelines.

   This draft provide requirements on collective communication
   optimization from the control and management viewpoint.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 1 September 2024.







Liu & Xu                Expires 1 September 2024                [Page 1]

Internet-Draft  Operations and Management Area Working G   February 2024


Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Requirements  . . . . . . . . . . . . . . . . . . . . . . . .   3
     2.1.  Memory Management . . . . . . . . . . . . . . . . . . . .   3
     2.2.  Topology Management . . . . . . . . . . . . . . . . . . .   4
     2.3.  Interfaces Management . . . . . . . . . . . . . . . . . .   5
     2.4.  Data Management . . . . . . . . . . . . . . . . . . . . .   6
   3.  Security Considerations . . . . . . . . . . . . . . . . . . .   6
   4.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   6
   5.  References  . . . . . . . . . . . . . . . . . . . . . . . . .   6
     5.1.  Normative References  . . . . . . . . . . . . . . . . . .   6
     5.2.  Informative References  . . . . . . . . . . . . . . . . .   6
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   7

1.  Introduction

   In recent years, with the development and evolution of various
   applications and business, especially the rapid growth of AI
   applications, distributed computing performance has become more and
   more important and has gradually become a key factor which limits the
   progress of these applications.  Due to the large amount of
   collective communication involved in distributed computing,
   collective communication performance is crucial.  However, there
   exists many problems to be solved for collective communication.  On
   one hand, many collective communication operations implemented by
   message-level communication libraries like MPI and NCCL mainly
   utilize the unicast point-to-point communication mechanism, leading
   to the redundancy of network traffic, the underutilization of network
   resources and waste of network capabilities.  On the other hand,
   since the underlying network protocols and collective communication
   are not co-designed, there is a semantic gap between inter-process
   message transportation and packet forwarding.  Therefore, there is
   huge space for the optimization of collective communication.  At



Liu & Xu                Expires 1 September 2024                [Page 2]

Internet-Draft  Operations and Management Area Working G   February 2024


   present, the industry and academy are also actively promoting the
   development, implementation and deployment of collective
   communication optimization.

   The research group Computing in the Network, COIN for short, also
   focus on this topic.  The goal is mainly to investigate how network
   data plane programmability can improve Internet architecture, with a
   too broad focus scope including network functions offloading, machine
   learning acceleration, in-network caching and in-network control,
   etc.  In addition to the solution of collective operation offloading
   COIN talk about, multicast substituting for single point unicast,
   scheduling tasks and planning transportation paths by topology
   awareness and bridging semantic gap between inter-process message
   transportation and packet forwarding can also play a significant role
   on optimizing collective communication.

   This draft provide some necessary requirements from the network
   control and management viewpoint, combined with the optimization
   solutions of collective communication offloading, multicast
   mechanisms, topology awareness and semantic gap bridge between inter-
   process message transportation and packet forwarding, to guideline
   the standardization work of collective communication optimization.

2.  Requirements

2.1.  Memory Management

   Scarce memory resources provided by network devices for collective
   communication MUST be scheduled and controlled, e.g. assigning a
   scheduling priority to collective communication offloading tasks.
   Compared to the amount of collective communication message in the
   applications such as AI, HPC, etc., it is severely mismatched and
   extremely scarce for memory resource provided by network devices for
   collective communication, such as network programmable switches.

   Use Case[ESA].  The memory of programmable switch is scarce for the
   amount of gradient transmitted in distributed training.  There is
   some existing work to solve this problem like pool-based streaming
   and dynamic sharing, which are not enough yet.  A use case of fully
   utilizing the memory of programmable switch is that the control and
   management module of switch assigns a priority to the aggregation
   task, to dynamically and preemptively schedule the aggregation tasks
   in the data plane, thus making more full use of memory in the form of
   switch aggregators.







Liu & Xu                Expires 1 September 2024                [Page 3]

Internet-Draft  Operations and Management Area Working G   February 2024


             +----------+  +----------+           +----------+
             |          |  |          |           |          |
             | worker 1 |  | worker 2 |           | worker n |
             |          |  |          |  ... ...  |          |
             +----+-----+  +----+-----+           +-----+----+
                  |             |                       |
                  |             |                       |
                  +------+------+-----------------------+
                         |GB/TB-level gradients
             +-----------+-----------+           +-----------+
             |           |           |           |           |
             |  +--------+--------+  |           |           |
             |  |     Switch      |  |           |  Control  |
             |  |   Aggregators   |  |  Manage   |     &     |
             |  +-----------------<--+-----------+ Management|
             |  |Memory for others|  | Schedule  |           |
             |  +-----------------+  |           |           |
             | Switch Memory = 10MB  |           |           |
             +-----------------------+           +-----------+

      Figure 1: The Mismatched between Device Memory and Communication
                                   Volume

2.2.  Topology Management

   Topology awareness and mapping work are REQUIRED to be done to put
   some of the end-host computing on the network nodes for collective
   communication optimization.  In many collective operation tasks, the
   logical relationship between nodes is usually described in the form
   of graph, and then mapping to the physical network.  Therefore,
   collective communication offloading requires awareness of the network
   topology and making efficient mappings.

   Use Case.  In the parameter server architecture commonly used in
   distributed training, the parameter server can be reasonably mapped
   to spine switches in the fat tree physical network with being aware
   of network topology.  Under this mapping mechanism, the traffic path
   is more simplified and the traffic volume of the whole network is
   greatly compressed.  Compared to the traditional collective
   communication mode, the optimized end-to-network or end-to-network-
   to-end one with topology awareness and mapping makes the physical
   topology and the logical topology closer, more friendly and unified.









Liu & Xu                Expires 1 September 2024                [Page 4]

Internet-Draft  Operations and Management Area Working G   February 2024


                    Logical Topology
                   +----------------+
                   |Parameter Server|
                   +--------+-------+
                            |
             +----------+---+------------+
             |          |                |
        +----+---+ +----+---+        +---+----+
        |Worker 1| |Worker 2| ...... |Worker n|
        +--------+ +--------+        +--------+
                            |Mapping
                            |
                 +----------+---------+
                 |Management & Control|
                 |                    |
                 | Topology Awareness |
                 | Paths Planning     |
                 +----------+---------+
                            |
                            |Mapping
                            v
                    Physical Topology
                +-----+         +-----+
                |Spine|         |Spine|
                +--+--+         +--+--+
                   |               |
        +----------+-+------------++-----------+
        |            |            |            |
     +--+--+      +--+--+      +--+--+      +--+--+
     |Leaf |      |Leaf |      |Leaf |      |Leaf |
     +--+--+      +--+--+      +--+--+      +--+--+
        |            |            |            |
     +--+--+      +--+--+      +--+--+      +--+--+
     |     |      |     |      |     |      |     |
   +-+-+ +-+-+  +-+-+ +-+-+  +-+-+ +-+-+  +-+-+ +-+-+
   |GPU| |GPU|  |GPU| |GPU|  |GPU| |GPU|  |GPU| |GPU|
   +---+ +---+  +---+ +---+  +---+ +---+  +---+ +---+

             Figure 2: Topology Management and Topology Mapping

2.3.  Interfaces Management

   Some collective communication interfaces MUST be defined and managed
   for application developers to shield tedious network engineering
   details, such as flow control, packet organization, chip-specific
   programming language, etc.  If not, applications developers will need
   too much arcane knowledge and expertise, which is beyond their
   willingness and prevent from the evolution of the emerging



Liu & Xu                Expires 1 September 2024                [Page 5]

Internet-Draft  Operations and Management Area Working G   February 2024


   applications.

   Use case.  The industry and academy have actually proposed some
   abstractions of collective communication operations, such as
   collective communication libraries MPI, NCCL, NetRPC[NetRPC], etc.
   In the control plane, these interfaces need to be configured and
   instantiated to complete the part of collective communication
   functionality.

2.4.  Data Management

   The semantic gap between application data unit and network data unit
   is REQUIRED to be bridged.  This semantic mismatch poses a huge
   obstacle to the support of upper layer applications in the underlying
   network.

   Use Case.  In the distributed training of LLMs, AllReduce, a common
   kind of collective communication operation, involves the
   transportation of large amount of message-level gradients or token
   data, which is orders of magnitude of packets in network in size.
   Because packet in network is designed originally not considering
   large data transmission, there is natural mismatch property between
   network and collective communication.

3.  Security Considerations

   TBD.

4.  IANA Considerations

   TBD.

5.  References

5.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

5.2.  Informative References

   [ESA]      Wang, H., "Efficient Data-Plane Memory Scheduling for In-
              Network Aggregation", 2022.

   [NetRPC]   Zhao, B., "NetRPC: Enabling In-Network Computation in
              Remote Procedure Calls", 2023.



Liu & Xu                Expires 1 September 2024                [Page 6]

Internet-Draft  Operations and Management Area Working G   February 2024


Authors' Addresses

   Chang Liu
   China Mobile
   Beijing
   100053
   China
   Email: liuchangjc@chinamobile.com


   Shiping Xu
   China Mobile
   Beijing
   100053
   China
   Email: xushiping@chinamobile.com



































Liu & Xu                Expires 1 September 2024                [Page 7]