Internet DRAFT - draft-yao-tsvwg-cco-problem-statement-and-usecases
draft-yao-tsvwg-cco-problem-statement-and-usecases
Transport Area Working Group K. Yao
Internet-Draft S. Xu
Intended status: Informational China Mobile
Expires: 25 April 2024 Y. Li
H. Huang
Huawei Technologies
D. KUTSCHER
HKUST (Guangzhou)
23 October 2023
Collective Communication Optimization: Problem Statement and Use cases
draft-yao-tsvwg-cco-problem-statement-and-usecases-00
Abstract
Collective communication is the basic logical communication model for
distributed applications. When distributed systems scales, the
communication overhead becomes the bottleneck of the entire system,
impeding system performance to increase. This draft describes the
performance challenges when the collective communication is employed
in a network with more nodes or processes participating in or a
larger number of such communication rounds required to complete a
single job. And the document presents several use cases where
different aspects of collective communication optimization are
needed.
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
Yao, et al. Expires 25 April 2024 [Page 1]
Internet-Draft Collective Communication Optimization: P October 2023
This Internet-Draft will expire on 25 April 2024.
Copyright Notice
Copyright (c) 2023 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1. Distributed Training . . . . . . . . . . . . . . . . . . 4
2.2. High-Performance Computing . . . . . . . . . . . . . . . 5
2.3. Distributed Storage Systems . . . . . . . . . . . . . . . 6
2.4. Big Data Analysis(MapReduce) . . . . . . . . . . . . . . 7
3. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 8
3.1. Collective Message Transport Issues . . . . . . . . . . . 10
3.1.1. Reliability . . . . . . . . . . . . . . . . . . . . . 10
3.1.2. Semantic Gap Between Message and Packet . . . . . . . 12
3.1.3. Blocking and Non-blocking Communications . . . . . . 14
3.2. Control and Management Plane Issues . . . . . . . . . . . 14
3.2.1. In-Network Computing Primitives . . . . . . . . . . . 14
3.2.2. Topology Awareness . . . . . . . . . . . . . . . . . 14
3.3. One to Group Transmission Issues . . . . . . . . . . . . 15
4. Security Considerations . . . . . . . . . . . . . . . . . . . 15
5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16
6. References . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.1. Normative References . . . . . . . . . . . . . . . . . . 16
6.2. Informative References . . . . . . . . . . . . . . . . . 17
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 17
1. Introduction
Collective communication is the basic logical communication model for
distributed applications like AI model training and inference, high-
performance computing, big data analysis and distributed storage. It
defines several inter-process communication modes upon which modern
high performance distributed systems can be built. There are several
existing open standards and programming models that have supported
Yao, et al. Expires 25 April 2024 [Page 2]
Internet-Draft Collective Communication Optimization: P October 2023
the collective communications like Message Passing Interface(MPI),
Partitioned Global Address Space Language(PGAS), etc. However, these
programming models only focus on application level, which does not
differentiate the underlying network capabilities. Existing
implementation of collective communication employs point-to-
point(P2P) mode at the network in most cases. Given the nature of
collective communication involves one or more senders and one or more
receivers, it inevitably leads to an obvious communication overhead
when blunt point to point is used inside these distributed systems.
There need a more optimized and performant communication mechanism
that's dedicated designed for collective communication.
One of the implementation methods of collective communication is
Remote Direct Memory Access(RDMA) mechanism. IB network naturally
supports RDMA, and is the state-of-the-art networking solution for
collective communication. To improve collective communication
performance, IB has some enhancements, like SHArP(Scalable
Hierachical Aggregation Protocol)[SHArP] and hardware
multicast[Hardware_Multicast], which could support collective
operations offloading, saving bandwidth and reducing communication
latency.
However, Ethernet-based RDMA does not support such capabilities.
While hopefully, Ethernet is the most widely used link layer
protocol, Ethernet-based RDMA should evolve to support collective
communication offloading. And there need some optimization in
transport protocols and application-network co-design to make
Ethernet-based RDMA more suitable for collective communication.
These work should be considered in IETF.
This draft first presents four different use cases to illustrate
where collective communication is used and how optimized collective
communication can upgrade distributed application performance. Then
the draft points out the fundamental reason that the existing
protocols cannot meet the high-performance requirements of collective
communications is that these distributed applications are not co-
designed with the underlying networking protocols. There is a
semantic gap between inter-process message transportation and packet
forwarding, which should be bridged by efficient mapping and
optimization. This draft further analyses the problem from three
different perspectives. One is that the current end to end transport
protocol does not support efficient offloading for collective
operations. Secondly, the control and management plane does not
support specific extensions required by collective operations, like
network offloading based topology awareness. The last one is that
the current implementation of collective communication can not make
use of the existing IP multicast protocols, which can reduce the
communication overhead and save the bandwidth.
Yao, et al. Expires 25 April 2024 [Page 3]
Internet-Draft Collective Communication Optimization: P October 2023
2. Use Cases
2.1. Distributed Training
The Large Language Model(LLM) like chatGPT has introduced a
phenomenal impact to the industry, leading the digital and
intelligent transformation of the whole society. Foundation models
usually have over trillions of parameters which inevitably need to be
deployed in a distributed manner for model training and model
inference. In foundation models like transformer, the commonly used
training mode is Mixture of Expert(MoE). MoE has two basic
collective operations, AlltoAll and Allreduce.
In AlltoALL phase, the Gate deployed in each device passes the
intermediate gradients to Feed Forward Nodes(FFNs) deployed in other
devices, and FFNs then pass the computation results to the block "Add
& Normalization" for the next-step computation. ALLtoALL
transmission includes group-to-group communication logic, which will
incur a lot of bandwidth contention and can be optimized.
In Allreduce phase, the gradients generated from each device need to
be submitted to a central node for aggregation. There may have an
incast problem if the number of distributed nodes is too large and
the gradients messages take up too much bandwidth. Offloading
Allreduce operation to network devices can saving lots of
transmission bandwidth and obviously it could halve the transmission
distance and reduce the latency.
Yao, et al. Expires 25 April 2024 [Page 4]
Internet-Draft Collective Communication Optimization: P October 2023
DeVice0 ... DeVice1
+--------------------+ +--------------------+
| | DeVice0 | |
| +--------------- | ... | +--------------- |
| |self attention| | DeViceN | |self attention| |
| +--------------+ | | +--------------+ |
| +--------------+ | | +--------------+ |
| | Add & Norm | | | | Add & Norm | |
| +--------------+ | | +--------------+ |
| +--------------+ | | +--------------+ |
| | +------+ | | | | +------+ | |
| | | Gate | <------+ +----->+ Gate | | |
| | +------+ | | | +--------+ | | +------+ | |
| | +------+ | | +-->+AlltoAll+<--+ | +------+ | |
| | | FFN0 | <------+ +--------+ +----->+ FFN0 | | |
| | +------+ | | | | | +------+ | |
| +--------------+ | | +--------+ | +--------------+ |
| +--------------+ | +-->+AlltoAll+<--+ +--------------+ |
| | Add & Norm +<---+ +--------+ +->+ Add & Norm | |
| +--------------+ | | +--------------+ |
| | | | | |
+---------|----------+ +---------|----------+
| |
| +---------+ |
+--------------> |Allreduce| <----------+
+---------+
Distributed
Figure 1: Collective communication in MoE
2.2. High-Performance Computing
The basis for HPC is parallel computing. In modern HPC clusters,
parallel computing is usually implemented through multiple CPU cores
programming at the same time, finishing a task concurrently and
leading to higher performance. During parallel computing, messages
are passed between processes across different CPU cores, and this is
when several collective operations are needed. In main-worker mode,
Allreduce happens when the main node gathers messages from several
workers and computes to get the aggregation results.
Yao, et al. Expires 25 April 2024 [Page 5]
Internet-Draft Collective Communication Optimization: P October 2023
+-------+
+-------> |worker0|
| +-------+
|
|
| Allreduce
|
|
+----+ | +-------+
|Main| <---------> |worker1|
+----+ | +-------+
|
|
| Allreduce
|
| +-------+
+-------> |worker2|
+-------+
Figure 2: Main-worker Allreduce in HPC clusters
2.3. Distributed Storage Systems
Collective operations like broadcast is also used in distributed
storage systems. Primary servers perform data operations such as
replication and modification and broadcast the message to backup
servers. Currently, the broadcast operation is implemented by using
commodity RDMA Network Interface Card (RNIC) with unicast operations,
and the performance of RDMA-based data replication is lagged by two
drawbacks of unicast traffic: data redundancy and independent
replicating states. On the one hand, data redundancy in multiple
replications incurs a large amount of bandwidth waste on the network
links to replica servers. On the other hand, the independent
replicating states reduplicate the CPU by independently posting the
request, polling the completion, and keeping track of the delivery
status of each unicast replication, which undoubtedly incurs a lot of
overhead during transmission.
Yao, et al. Expires 25 April 2024 [Page 6]
Internet-Draft Collective Communication Optimization: P October 2023
+-------+
|Primary|
+---+---+
|
| Replica
| broadcast
v
+--+---+
+-----------+Switch+----------+
| +--+---+ |
| | |
| | |
| | |
v v v
+--+---+ +--+---+ +---+--+
|Backup| |Backup| |Backup|
+------+ +------+ +------+
Figure 3: Broadcast in distributed storage systems
2.4. Big Data Analysis(MapReduce)
The main stream distributed big data analysis systems like spark also
have collective operations. The most communication performance
bottleneck happens in Shuffle phase. The Shuffle phase involves data
movement across multiple workers and it's originally implemented via
TCP/IP Socket for distributed communication among processes which
incurs low performance. The Shuffle phase can be accelerated via MPI
Java bindings. TCP will be replaced by MPI-based new transport for
high-performance data movement.
Yao, et al. Expires 25 April 2024 [Page 7]
Internet-Draft Collective Communication Optimization: P October 2023
before
shuffle
+--+ +--+
|c1| |c1|
+-------+ |c1| |a1| after shuffle
|worker2| |c1| |b1|
+-------+ +--+ +--+
^ +--+ +--+
| |a1| |a1|
| +-------+ |a1| |c1|
+------+ +---------> |worker1| |a1| |b1|
|Master| +-------+ +--+ +--+
+------+
^
+------+ |
|Driver| |
+------+ |
+---------------^
|
|
|
v +--+ +--+
|b1| |b1|
+-------+ |b1| |a1|
|worker3| |b1| |c1|
+-------+ +--+ +--+
Figure 4: Collective operations in Shuffle phase of big data systems
3. Problem Statement
The demand for computing resource in AI/HPC applications is growing
rapidly, and single node computing power can not meet it. Parallel
computing has become a trend. Parallel computing is a type of
computing architecture in which several processors simultaneously
execute multiple, smaller calculations broken down from an overall
larger, complex problem. Collective communication plays a role in
data aggregation, data distribution, and synchronization in
distributed computing systems. The collective communication
primitives include collective operations like Reduce, All-Reduce,
Bcast, Alltoall, Scatter, Gather, and synchronization operations like
Barrier, etc.
Yao, et al. Expires 25 April 2024 [Page 8]
Internet-Draft Collective Communication Optimization: P October 2023
+---------------+--------------+---------------------------------+
| Type | Function | Description |
+---------------+--------------+---------------------------------+
| | Bcast | One to group. |
| | | One process sends (broadcasts) |
| | | some data to all the processes |
| | | in a group. |
| +--------------+---------------------------------+
| | Gather | Group to one. |
| | | If an array is scattered across |
| | | all processes in the group. And |
| | | one process (root) collects each|
| | | piece of the array into a |
| | | specified array. |
| +--------------+---------------------------------+
| Data | Allgather | All processes, not just the |
| Movement | | root, receive the result of |
| | | Gather. |
| +--------------+---------------------------------+
| | Scatter | One-To-Group. |
| | | One process distributes the data|
| | | into n segments, where the i-th |
| | | segment is sent to the i-th |
| | | process in the group which has |
| | | n processes. |
| +--------------+---------------------------------+
| | Alltoall | This is an extension to |
| | | Allgather.Each process sends |
| | | distinct data to each receiver. |
| | | The j-th block from process i is|
| | | received by process j and stored|
| | | in the i-th block. |
+---------------+--------------+---------------------------------+
| | Reduce | Group to one. |
| | | Used to collect data or partial |
| | | results from multiple processing|
| | | units and to combine them into a|
| | | global result by a chosen |
| | | operator. |
| +--------------+---------------------------------+
| | All-Reduce | Sistribute the result of a |
| Data | | Reduce operation to all |
| | | processes in the group. |
| Aggregation +--------------+---------------------------------+
| |Reduce-Scatter| scattering the result of |
| | | reduction to all processes |
| +--------------+---------------------------------+
| | Scan | A Scan operation performs |
Yao, et al. Expires 25 April 2024 [Page 9]
Internet-Draft Collective Communication Optimization: P October 2023
| | | partial reductions on |
| | | distributed data. |
+---------------+--------------+---------------------------------+
|Synchronization| Barrier | A synchronous operation to |
| | | synchronize all processes |
| | | within a communicator. |
+---------------+--------------+---------------------------------+
Figure 5: Collective Communication of Parallel Computing
3.1. Collective Message Transport Issues
3.1.1. Reliability
Traditional transport layer provides reliable transmission
mechanisms, like TCP, QUIC, etc., only for point-to-point
communication, and supports best-effort transmission services for
multicast and broadcast communication. Therefore, in order to meet
the requirement of reliable transmission, the implementations of
collective operations like Bcast and AlltoAll are often based on
point-to-point operations at hosts.
This leads to the waste of host CPU resources caused by repeated
packaging and sending, while multiple identical data packets on the
same path also constitute redundant traffic and exacerbate network
load. To solve these problems, parallel computing networks need a
reliable transmission mechanism suitable for collective operations,
and cooperating with reasonable traffic engineering and congestion
control mechanisms.
Collective communication optimization generally uses a network node
like a switch to perform the collective operation like Reduce. The
network node receives the data from multiple senders and computes the
result based on the provided operator, e.g. SUM operator. There are
two ways in terms of reliability to ensure the correctness of the
collective operations.
The first one is to keep using the end-to-end reliability. The
intermediate node that performs the 'Reduce' is not considered as an
end point that participants in the traditional reliability guarantee.
Refer to the figure below, Host 1 to 3 are workers for a Reduce
operations, and Host 4 is the parameter server that collects the data
from all the workers to perform a SUM computing. The transport
session is established between each worker and the parameter server.
Switch in the middle performs the in-network aggregation. In case
there is any packet loss between any of the workers and switch or
between switch and parameter server, end-to-end reliability mechanism
like re-transmission would be triggered by end points. The
Yao, et al. Expires 25 April 2024 [Page 10]
Internet-Draft Collective Communication Optimization: P October 2023
intermediate switch releases the maintance burden of transport
session and most of the state maintenance. Consider the scenarios
where the data packet is small like in some of the high performance
computing. The switch simply keeps a single packet from every worker
until it can make the summation computing from all the workers. The
state maintained is minimum. It should be considered how the
reliability can be efficiently ensured in terms of the fast loss
detection and signaling when an intermediate node plays a role. In
addition to the reliability considerations, how to ensure the data to
be aggregated at the correct intermediate node or nodes should be
considered. Considering encryption scenarios, ubiquitous connection
encryption (QUIC etc.) make it hard to employ performance enhancing
functions.
The second one is to treat the intermediate switch as an end point.
The reliability is guaranteed between the worker and switch and
between switch and parameter server as two independent sessions.
That would require the switch to support the full transport function
which includes the session establishments and all the state
maintenance. It is expected that the lighter transport session
maintenance will bring more benefit. In addition, there should be
fall back signaling to tolerate the faults. Considering encryption
scenarios, If network devices are required to encrypt and decrypt
data, it will require a lot of resources, maintain a large number of
sessions, and also involve issues such as maintaining secret keys.
+-----+
|Pkt_1|
+-----+
+------+
|Host_1+---------+
+------+ |
|
+-----+ | +-------+
|Pkt_2| | +-------------------------+ |Pkt_Agg|
+-----+ | | | +-------+
+------+ | | Switch | +------+
|Host_2+---------+--> (In-Network Aggregation)+----------->|Host_4|
+------+ | | | +------+
| +-------------------------+
+-----+ |
|Pkt_3| |
+-----+ |
+------+ |
|Host_3+---------+
+------+
Figure 6: In-Network Aggregation Packet Termination
Yao, et al. Expires 25 April 2024 [Page 11]
Internet-Draft Collective Communication Optimization: P October 2023
3.1.2. Semantic Gap Between Message and Packet
Collective operations offloading utilizes network devices to achieve
low computational accuracy and high IO computing tasks, achieving
optimization of collective operations. Collective communication
optimization devices not only complete packet routing and forwarding,
but also need to process transport layer messages. Therefore, the
transport layer needs to complete the mapping mechanism between
packets and messages, and complete the transmission layer message
transmission. Existing transport layers cannot provide this
function.
Message size will impact the performance of implementation of
collective operations. The packet size usually has an upper bound,
for example, the jumbo frame of Ethernet is 9.6kB, while message size
dose not have limitation and it is determined by applications. For
network devices, processing large size messages will incur a lot of
overhead, reflected in deep buffer and its Serialization and
Deserialization(SerDes) may have much pressure. In addition, it will
also impact the message sending rate at the host and thus lower the
end-to-end system performance. How to choose the appropriate message
size and optimize its processing is still a problem.
Yao, et al. Expires 25 April 2024 [Page 12]
Internet-Draft Collective Communication Optimization: P October 2023
Host Traditional In-Network Host
Network Devices Network Devices
+-------------+ +--------------+
| | | |
| +---------+ | | +----------+ |
| | App | | +--------------+ | | App | |
| +---------+ | | | | +----------+ |
| | | | | |
++-------------+------------------+--------------+--+--------------++
|| | Messages | | ||
|| +---------+ | | +----------+ | | +----------+ ||
|| |Transport| | | | Message | | | | Transport| ||
|| | Layer | | | |Processing| | | | Layer | ||
|| +---------+ | | +----------+ | | +----------+ ||
|| | | | | ||
++-------------+------------------+--------------+--+--------------++
| | | | | |
| | +--------------+ | | | |
++-------------+-+--------------+-+--------------+--+--------------++
|| | | | | | | ||
|| +---------+ | | | | | | +----------+ ||
|| | Network | | | Packets | | | Network | ||
|| | Layer | | | | | | | | Layer | ||
|| | (IB, | | | | | | | | (IB, | ||
|| | TCP/IP) | | | +----------+ | | +----------+ | | | TCP/IP) | ||
|| +---------+ | | | | | | | | | | +----------+ ||
|| | | | | | | | | | | ||
|| +---------+ | | | Packet | | | | Packet | | | +----------+ ||
|| | Link | | | |Forwarding| | | |Forwarding| | | | Link | ||
|| | Layer | | | | | | | | | | | | Layer | ||
|| |(IB Link,| | | | | | | | | | | | (IB Link,| ||
|| | Eth) | | | +----------+ | | +----------+ | | | Eth) | ||
|| +---------+ | | | | | | +----------+ ||
|| | | | | | | ||
++-------------+-+--------------+-+--------------+--+--------------++
+---------+---+ +-----+----+---+ +----+----+----+ +-----+--------+
| ^ | ^ | ^
+-----------+ +----------+ +-------------+
Figure 7: In-Network computing devices need to process transport
layer messages
Yao, et al. Expires 25 April 2024 [Page 13]
Internet-Draft Collective Communication Optimization: P October 2023
3.1.3. Blocking and Non-blocking Communications
Parallel computing communication operators can be divided into two
categories based on whether to wait for message feedback results
before proceeding with the next step of computation: blocking and
non-blocking. Blocking collective operations will experience process
blocking until communication is completed. Non-blocking operations
allow communication to be handled by the backend, and communication
and computation can overlap. Parallel computing networks need to
support two types of communication methods simultaneously, and for
blocking communication, it is even more necessary to use methods such
as in-network computing technology to to optimize collective
communication performance.
3.2. Control and Management Plane Issues
3.2.1. In-Network Computing Primitives
After supporting collective operations offloading, collective
communication library needs to be expanded to not only support end-
to-end mode, but also end-to-network communication modes, whitch
means suporting in-network processing/computing functions.
The implementation of the offloaded collective operations relies on
the capabilities provided by physical network devices, known as in-
network computing primitives. Currently, most online computing
devices use programmable network devices such as P4(Programming
Protocol-independent Packet Processors) and NPL(Network Programming
Language). These programmable switches are not powerful enough to
program non-trivial application-layer behavior. And their platforms
are too heterogeneous to manage. We need to standardize the in-
network computing primitives that network devices can provide,
including data types, operations, etc. Based on the standard in-
network computing primitives, we need to implement parallel in-
network computing communication operators.
3.2.2. Topology Awareness
In data center, fat tree or CLOS topology is widely used. There are
some full meshed network topology as well.
Yao, et al. Expires 25 April 2024 [Page 14]
Internet-Draft Collective Communication Optimization: P October 2023
As the scale of parallel computing clusters continues to increase, it
is difficult to achieve optimal transmission performance solely
through automatic convergence of traditional routing protocols. It
is necessary to cooperate with topology-aware algorithms to explore
complex topologies, and complete path planning, traffic optimization,
etc. However, existing topology-aware algorithms do not consider
collective operations offloading scenarios and need to be modified to
realize end-network cooperation, meeting the requirements of end-to-
end path planning and traffic optimization.
3.3. One to Group Transmission Issues
IP multicast has been designed to support broadcast related
applications like live streaming and video conferencing. There are
some standardized IP multicast protocols, like PIMRFC 7761
[RFC7761]and BIERRFC 8279 [RFC8279]. However, existing IP multicast
protocols havn't been made the full use for distributed application
systems which require collective communication. Most collective
operations still use unicast mode for transmission which definitely
incurs a lot of overhead, reflected in information redundancy,
bandwidth occupancy and host CPU consumption. What we need is low
latency multi-destination delivery (potentially with reliability or
at least failure detection). IP-Multicast is more a potential
solution (maybe not a good one), but still worth trying. Even though
existing IP multicast protocols may have their best suited
application scenarios and they differ in state maintainance and
multicast tree construction, they are still well worth promoting for
collective communication scenarios. Collective operations like Bcast
and AlltoAll can be augmented by extending IP multicast protocols, as
well as other composite operations like reduce-scatter and all-
gather.
4. Security Considerations
Collective communication optimization may introduce some security and
privacy concerns, especially when collective operations offloading is
needed.
Yao, et al. Expires 25 April 2024 [Page 15]
Internet-Draft Collective Communication Optimization: P October 2023
On one hand, the distributed nature of computations and the
involvement of network devices raise issues about data
confidentiality, integrity, and authentication. There are some
potential vulnerabilities when data processed over the network is
exposed to unauthorized access. It's sugguested to support both
security-enabled and security-less deployments, so that limited
domains[RFC8799] do not have to pay the penalty of expensive crypto
or authority operations. Because application and the network within
limited domains can be mutual trust with each other, since they could
both belong to the same administrator. Extending the technology to
the Internet should be designed together with some intrinsic
protective actions.
On the other hand, encrypted data brings challenges and performance
issues for processing on network devices.. Decrypting and encrypting
data on network devices is not only inefficient, but also involves
issues such as key management and authorization. Processing
encrypted data may not be applicable to all encryption algorithms,
and not suitable for all scenarios.
5. IANA Considerations
TBD.
6. References
6.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC7761] Fenner, B., Handley, M., Holbrook, H., Kouvelas, I.,
Parekh, R., Zhang, Z., and L. Zheng, "Protocol Independent
Multicast - Sparse Mode (PIM-SM): Protocol Specification
(Revised)", STD 83, RFC 7761, DOI 10.17487/RFC7761, March
2016, <https://www.rfc-editor.org/info/rfc7761>.
[RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A.,
Przygienda, T., and S. Aldrin, "Multicast Using Bit Index
Explicit Replication (BIER)", RFC 8279,
DOI 10.17487/RFC8279, November 2017,
<https://www.rfc-editor.org/info/rfc8279>.
[RFC8799] Carpenter, B. and B. Liu, "Limited Domains and Internet
Protocols", RFC 8799, DOI 10.17487/RFC8799, July 2020,
<https://www.rfc-editor.org/info/rfc8799>.
Yao, et al. Expires 25 April 2024 [Page 16]
Internet-Draft Collective Communication Optimization: P October 2023
6.2. Informative References
[Hardware_Multicast]
Liu, J., "Fast and Scalable MPI-Level Broadcast using
InfiniBand's Hardware Multicast Support",
DOI 10.1109/ipdps.2004.1302912, 2004,
<https://doi.org/10.1109/ipdps.2004.1302912>.
[SHArP] Graham, R. L., "Scalable Hierarchical Aggregation Protocol
(SHArP): A Hardware Architecture for Efficient Data
Reduction", DOI 10.1109/COMHPC.2016.006, 2023,
<https://doi.org/10.1109/COMHPC.2016.006>.
Authors' Addresses
Kehan Yao
China Mobile
Beijing
100053
China
Email: yaokehan@chinamobile.com
Shiping Xu
China Mobile
Beijing
100053
China
Email: xushiping@chinamobile.com
Yizhou Li
Huawei Technologies
Nanjing, Jiangsu
China
Email: liyizhou@huawei.com
Hongyi Huang
Huawei Technologies
Beijing
China
Email: hongyi.huang@huawei.com
Yao, et al. Expires 25 April 2024 [Page 17]
Internet-Draft Collective Communication Optimization: P October 2023
Dirk KUTSCHER
HKUST (Guangzhou)
Guangzhou
China
Email: dku@hkust-gz.edu.cn
Yao, et al. Expires 25 April 2024 [Page 18]