Internet DRAFT - draft-liu-ops-cco-cm-requirement
draft-liu-ops-cco-cm-requirement
Operations and Management Area Working Group C. Liu
Internet-Draft S. Xu
Intended status: Informational China Mobile
Expires: 25 April 2024 23 October 2023
Requirements from Control and Management Viewpoint for Collective
Communication Optimization
draft-liu-ops-cco-cm-requirement-00
Abstract
Collective communication optimization is a key means to improve the
performance of distributed applications, due to that communication
has become the bottleneck to degrade applications or business with
the growth of the scale of distributed systems. The industry and
academy has worked on proposing solutions to upgrade the collective
communication operations. However, there has been the problem of
lacking for unified guidelines.
This draft provide requirements from the control and management
viewpoint.
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 25 April 2024.
Liu & Xu Expires 25 April 2024 [Page 1]
Internet-Draft Operations and Management Area Working G October 2023
Copyright Notice
Copyright (c) 2023 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1. Memory Management . . . . . . . . . . . . . . . . . . . . 3
2.2. Topology Management . . . . . . . . . . . . . . . . . . . 4
2.3. Interfaces Management . . . . . . . . . . . . . . . . . . 5
3. Security Considerations . . . . . . . . . . . . . . . . . . . 6
4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 6
5. References . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.1. Normative References . . . . . . . . . . . . . . . . . . 6
5.2. Informative References . . . . . . . . . . . . . . . . . 6
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 6
1. Introduction
In recent years, with the development and evolution of various
applications and business, especially the rapid growth of AI
applications, distributed computing performance has become more and
more important and has gradually become a key factor restricting the
growth of these applications. As the primary communication mode of
current distributed computing systems, the performance of collective
communication is crucial. However, there exists many problems to be
solved for collective communication to improve performance. On the
one hand, many collective communication operations implemented by
message-level communication libraries like MPI and NCCL mainly depend
on the unicast point-to-point communication mechanism, leading to the
redundancy of network information, the underutilization of network
resources and the waste of network capabilities. On the other hand,
since the underlying network protocols and collective communication
are not co-designed, there is a semantic gap between inter-process
message transportation and packet forwarding. Therefore, there is
huge space for the optimization of collective communication. At
present, the industry and academia are also actively promoting the
Liu & Xu Expires 25 April 2024 [Page 2]
Internet-Draft Operations and Management Area Working G October 2023
development, implementation and deployment of collective
communication optimization topic.
The research group Computing in the Network, COIN for short, also
focus on this topic. Their work goal is mainly to investigate how
network data plane programmability can improve Internet architecture,
with a too broad focus scope including network functions offloading,
machine learning acceleration, in-network caching and in-network
control, etc. In addition to the solution of collective operation
offloading COIN talk about for collective communication optimization,
multicast substituting for single point unicast, scheduling tasks and
planning transportation paths by topology awareness and bridge
semantic gap between inter-process message transportation and packet
forwarding can also play the role of optimizing collective
communication.
This draft provide some necessary requirements from the network
control and management viewpoint, combined with the optimization
solutions of collective communication offloading, multicast
mechanisms, topology awareness and semantic gap bridge between inter-
process message transportation and packet forwarding, to guideline
the standardization work of collective communication optimization.
2. Requirements
2.1. Memory Management
Scarce memory resources provided by network devices for collective
communication MUST be scheduled and controlled, e.g. assigning a
scheduling priority to collective communication offloading tasks.
Compared to the amount of collective communication message in the
applications such as AI, Big Data System, HPC, etc., it is severely
mismatched and extremely scarce for memory resource provided by
network devices for collective communication, such as network
programmable switches.
Use Case[ESA]. The memory of programmable switch is scarce for the
amount of gradient transmitted in distributed training. There is
some existing work to solve this problem like pool-based streaming
and dynamic sharing, which are not enough yet. A use case of fully
utilizing the memory of programmable switch is that the control and
management module of switch assigns a priority to the aggregation
task, to dynamically and preemptively schedule the aggregation tasks
in the data plane, thus making more full use of memory in the form of
switch aggregators.
Liu & Xu Expires 25 April 2024 [Page 3]
Internet-Draft Operations and Management Area Working G October 2023
+----------+ +----------+ +----------+
| | | | | |
| worker 1 | | worker 2 | | worker n |
| | | | ... ... | |
+----+-----+ +----+-----+ +-----+----+
| | |
| | |
+------+------+-----------------------+
|GB/TB-level gradients
+-----------+-----------+ +-----------+
| | | | |
| +--------+--------+ | | |
| | Switch | | | Control |
| | Aggregators | | Manage | & |
| +-----------------<--+-----------+ Management|
| |Memory for others| | Schedule | |
| +-----------------+ | | |
| Switch Memory = 10MB | | |
+-----------------------+ +-----------+
Figure 1: The Mismatched between Device Memory and Communication
Volume
2.2. Topology Management
Topology awareness and mapping work are REQUIRED to be done to put
some of the end-host computing on the network nodes for collective
communication optimization. In many collective operation tasks, the
logical relationship between nodes is usually described in the form
of graph, and then mapping to the physical network. Therefore,
collective communication offloading requires awareness of the network
topology and making efficient mappings.
Use Case. In the parameter server architecture commonly used in
distributed training, the parameter server can be reasonably mapped
to spine switches in the fat tree physical network with being aware
of network topology. Under this mapping mechanism, the traffic path
is more simplified and the traffic volume of the whole network is
greatly compressed. Compared to the traditional collective
communication mode, the optimized end-to-network or end-to-network-
to-end one with topology awareness and mapping makes the physical
topology and the logical topology closer, more friendly and unified.
Liu & Xu Expires 25 April 2024 [Page 4]
Internet-Draft Operations and Management Area Working G October 2023
Logical Topology
+----------------+
|Parameter Server|
+--------+-------+
|
+----------+---+------------+
| | |
+----+---+ +----+---+ +---+----+
|Worker 1| |Worker 2| ...... |Worker n|
+--------+ +--------+ +--------+
|Mapping
|
+----------+---------+
|Management & Control|
| |
| Topology Awareness |
| Paths Planning |
+----------+---------+
|
|Mapping
v
Physical Topology
+-----+ +-----+
|Spine| |Spine|
+--+--+ +--+--+
| |
+----------+-+------------++-----------+
| | | |
+--+--+ +--+--+ +--+--+ +--+--+
|Leaf | |Leaf | |Leaf | |Leaf |
+--+--+ +--+--+ +--+--+ +--+--+
| | | |
+--+--+ +--+--+ +--+--+ +--+--+
| | | | | | | |
+-+-+ +-+-+ +-+-+ +-+-+ +-+-+ +-+-+ +-+-+ +-+-+
|GPU| |GPU| |GPU| |GPU| |GPU| |GPU| |GPU| |GPU|
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
Figure 2: Topology Management and Topology Mapping
2.3. Interfaces Management
Some collective communication interfaces MUST be defined and managed
for application developers to shield tedious network engineering
details, such as flow control, packet organization, chip-specific
programming language, etc. If not, applications developers will need
too much arcane knowledge and expertise, which is beyond their
willingness and prevent from the evolution of the emerging
Liu & Xu Expires 25 April 2024 [Page 5]
Internet-Draft Operations and Management Area Working G October 2023
applications.
Use case. The industry and academy have actually proposed some
abstractions of collective communication operations, such as
collective communication libraries MPI, NCCL, NetRPC[NetRPC], etc.
In the control plane, these interfaces need to be configured and
instantiated to complete the part of collective communication
functionality.
3. Security Considerations
TBD.
4. IANA Considerations
TBD.
5. References
5.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
5.2. Informative References
[ESA] Wang, H., "Efficient Data-Plane Memory Scheduling for In-
Network Aggregation", 2022.
[NetRPC] Zhao, B., "NetRPC: Enabling In-Network Computation in
Remote Procedure Calls", 2023.
Authors' Addresses
Chang Liu
China Mobile
Beijing
100053
China
Email: liuchangjc@chinamobile.com
Liu & Xu Expires 25 April 2024 [Page 6]
Internet-Draft Operations and Management Area Working G October 2023
Shiping Xu
China Mobile
Beijing
100053
China
Email: xushiping@chinamobile.com
Liu & Xu Expires 25 April 2024 [Page 7]