Internet DRAFT - draft-yao-tsvwg-cco-requirement-and-analysis
draft-yao-tsvwg-cco-requirement-and-analysis
Transport Area Working Group K. Yao
Internet-Draft S. Xu
Intended status: Informational China Mobile
Expires: 8 August 2024 Y. Li
H. Huang
Huawei Technologies
W. Wang
New H3C Technologies Co., Ltd
D. KUTSCHER
THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY (GUANGZHOU)
5 February 2024
Collective Communication Optimizations: Requirement and Analysis
draft-yao-tsvwg-cco-requirement-and-analysis-01
Abstract
Gernerative AI applications depend on large scale parallel computing
clusters for model training and inference. Existing implementations
of collective communication in parallel computing is built on top of
RDMA, the most adoptable AI transport protocol. However, One-to-
Many, Many-to-One, and Many-to-Many collective operations all depend
on point-to-point transport semantics of RDMA, which inevitably
introduces more bandwidth occupancy and transmission overhead.
Emerging approaches for collective communication optimization focus
on network-assisted collective acceleration and can work compatibly
with RDMA. This document analyzes different technical schemes for
network-assisted collective acceleration based on RDMA, and presents
the gap between these work and current IETF standards, notably iWARP.
Requirements for designing new standards are proposed accordingly.
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Yao, et al. Expires 8 August 2024 [Page 1]
Internet-Draft Collective Communication Optimizations: February 2024
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 8 August 2024.
Copyright Notice
Copyright (c) 2024 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Definition of Terms . . . . . . . . . . . . . . . . . . . . . 3
3. Existing Work and Analysis . . . . . . . . . . . . . . . . . 3
3.1. Classification and Analysis of Two Modes for NACA . . . . 4
3.1.1. NACA Based on Server-to-server RDMA Connection . . . 4
3.1.2. NACA Based on Server-to-switch RDMA Connection . . . 5
3.2. Gap Analysis of Existing Solutions . . . . . . . . . . . 5
3.2.1. Infiniband SHARP . . . . . . . . . . . . . . . . . . 5
3.2.2. RoCEv2 Solutions . . . . . . . . . . . . . . . . . . 6
3.2.3. iWARP . . . . . . . . . . . . . . . . . . . . . . . . 8
4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1. NACA Function Design and Header Definition . . . . . . . 9
4.2. Bridge RDMA Transport Semantics . . . . . . . . . . . . . 10
4.3. RDMA Transport Related Issues . . . . . . . . . . . . . . 10
4.4. Joint Design of NACA Task Assignment and Routing
Policies . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5. Security and Traffic Isolation . . . . . . . . . . . . . 11
4.6. Fault Tolerance . . . . . . . . . . . . . . . . . . . . . 11
5. Security Considerations . . . . . . . . . . . . . . . . . . . 12
6. Operational Considerations . . . . . . . . . . . . . . . . . 12
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12
8. References . . . . . . . . . . . . . . . . . . . . . . . . . 12
8.1. Normative References . . . . . . . . . . . . . . . . . . 12
8.2. Informative References . . . . . . . . . . . . . . . . . 13
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 13
Yao, et al. Expires 8 August 2024 [Page 2]
Internet-Draft Collective Communication Optimizations: February 2024
1. Introduction
With the development of distributed applications, especially High
Performance Computing (HPC) and Artificial Intelligence (AI), the
scale of parallel computing clusters is constantly expanding, and the
pressure brought by collective communication to the network is also
increasing. Existing implementations of collective communication are
based on RDMA(Remote Direct Memory Access), however, the most obvious
problem is that the point-to-point transmission semantic of RDMA is
not well aligned with logical communication patterns defined in
collective communication, which incurs more bandwidth occupancy,more
memory copies at endpoints and more data movements, thus lowering the
overall parallel computing efficiency. Detailed use cases and
problems are proposed in
[I-D.yao-tsvwg-cco-problem-statement-and-usecases].
Emerging collective communication optimization technical schemes
focus on network assisted collective acceleration, which can greatly
alleviate network pressure, improve transmission efficiency, and
shorten flow completion time (FCT). Some of these approaches can
also work compatibly with RDMA, which raises new standardization
design space for extended RDMA-based protocols for collective
communication optimization. In following sections, this document
analyzes different technical schemes for network-assisted collective
acceleration based on RDMA, and presents the gap between these work
and current IETF standards. Requirements for designing new standards
are proposed accordingly.
2. Definition of Terms
*Collective communication: A set of communication patterns that
application processes follow to communicate with each other in a
parallel computing cluster. These patterns include One-to-Many,
Many-to-one, or Many-to-Many delivery mode.
*Network Assisted Collective Acceleration(NACA): Using network
devices, like switches, to offload and perform collective operations,
so as to improve the overall collective communication efficiency.
3. Existing Work and Analysis
NACA offloads collective operations to switch to implement. For
example, Allreduce is done in switch for aggregation. Broadcast is
offloaded to switch for data copy, and Scatter leverages switch for
data tailoring. Detailed collective operations are listed in
[I-D.yao-tsvwg-cco-problem-statement-and-usecases]. NACA can be
built on RDMA so as to optimize collective communication. RDMA
allows endpoints to directly read and write memory data from other
Yao, et al. Expires 8 August 2024 [Page 3]
Internet-Draft Collective Communication Optimizations: February 2024
endpoints at high speed without requiring kernel processing and CPU
resources. Memory zero copy and kernel bypass have gradually made
RDMA the mainstream communication technology for HPC and AI
applications in data centers. This draft mainly focuses on the
analysis of two different communication modes for RDMA-based network-
assisted collective acceleration.
3.1. Classification and Analysis of Two Modes for NACA
When using network devices to offload collective operations, RDMA
communication modes can be divided into server-to-sever mode and
server-to-switch mode.
3.1.1. NACA Based on Server-to-server RDMA Connection
The server-to-server RDMA connection mode means not to change the
logic of the existing applications. and switches that participate in
collective communication cannot be seen by applications. The
destination of RDMA connection is set to be another server endpoint,
but switches can participate in the collective operations during the
data transmission. In this communication mode, native transport
protocols can cause false positives. For example, when the switch
help perform data aggregation during Allreduce, in each round, there
will be only one aggregated packet sent from the switch to the
destination server. Packets from multiple senders are dropped after
the aggregation. And at the destination side, it will be judged as
packet loss and trigger packet retransmission
[I-D.yao-tsvwg-cco-problem-statement-and-usecases]. Some
modification on reliability mechanisms of native RDMA transport
should be improved.
+------------+ +----------------------+ +------------+
| <-----------------------------------------------> receiver 1 |
| | | | +------------+
| | | switch |
| sender | | forwarding and NACA |
| | | | +------------+
| <-----------------------------------------------> receiver 2 |
| | | RDMA connection | +------------+
| | | |
| | | |
| | | | +------------+
| <-----------------------------------------------> receiver 3 |
+------------+ +----------------------+ +------------+
Figure 1: NACA Based on Server-to-server RDMA Connection
Yao, et al. Expires 8 August 2024 [Page 4]
Internet-Draft Collective Communication Optimizations: February 2024
3.1.2. NACA Based on Server-to-switch RDMA Connection
The server-to-switch mode means that switches act as RDMA endpoints,
and RDMA connection is built between the sender, i.e. server, and
the destination, i.e. switch. In this case, hop-by-hop RDMA
connections are built for end-to-end data transmission. It is
necessary to define what RDMA functions should be offloaded to
switches, since offloading all RDMA functions to the network would
bring heavy burdens to switches not only on memory and buffer space,
but also on protocol complexity.
RDMA connection
+------------------------+ +------------+
| <-------------> receiver 1 |
| | +------------+
RDMA | |
Connection| |
+----------+ | switch | +------------+
| sender <----------> forwarding and NACA <-------------> receiver 1 |
+----------+ | | +------------+
| |
| |
| | +------------+
| <-------------> receiver 1 |
+------------------------+ +------------+
Figure 2: NACA Based on Server-to-switch RDMA Connection
3.2. Gap Analysis of Existing Solutions
3.2.1. Infiniband SHARP
Scalable Hierachical Aggregation and Reduction Protocol(SHARP)
[SHARP]SHARP breaks the end-to-end transport rule by implementing
Target Channel Adapter(TCA) in switches. The TCA supports both
Reliable Connection (RC) transport to enable reliable delivery of
data through the aggregation tree as well as Unreliable Datagram (UD)
transport to enable Multicast distribution of the aggregation result.
SHARP has been realized in Infiniband commodity switches, but as is
stated, SHARP is based on Infiniband architecture. Currently it
cannot work interoperably with the other network architectures, thus
limiting its applicability.
Figure 3 shows the SHARP protocol, it has two primary phases,
Aggregation Request and Aggregation Response. SHARP header is
designed over IBA header, followed by aggregation operations and
other data description information.
Yao, et al. Expires 8 August 2024 [Page 5]
Internet-Draft Collective Communication Optimizations: February 2024
+---------+---------+-------+------+-----------+--------+--------+----+
| IBA | SHARP | Tuple | User | Operation | Target | SHARP | CRC|
| Header | Header | Header| Data | Header | Header | Payload| |
+---------+---------+--------------+-----------+--------+--------+----+
Aggregation Request Pkt
+---------+---------+-------+------+--------+----+
| IBA | SHARP | Tuple | User | SHARP | CRC|
| Header | Header | Header| Data | Payload| |
+---------+---------+--------------+--------+----+
Aggregation Response Pkt
Figure 3: SHARP Protocol
3.2.2. RoCEv2 Solutions
RDMA over Converged Ethernet version 2(RoCEv2) is an RDMA scheme
based on the UDP protocol over Ethernet. Its core design uses
InfiniBand's transport layer, where data is transmitted in sequence
and retransmitted using go-back-n. Therefore, a lossless and ordered
network is required to achieve ideal performance. The network has
introduced Priority Flow Control (PFC) and IP based Explicit
Congestion Notification (ECN) to ensure lossless transmission.
Technical schemes of NACA based on RoCEv2 have been analyzed in both
academia and industry, but compared with Infiniband SHARP, there is
currently no commercial solutions, which means there are a lot of
standardization space in this area.
Take [NetReduce] and [Cepheus] as two examples for server-to-server
communication mode of RoCEv2-based NACA.
[NetReduce] is designed to offload Allreduce to switches. For ring-
Allreduce, workers establish RDMA connections with front and rear
workers, using RDMA write to send parameters, and RDMA read to
receive aggregation results. The switch is a man-in-the-middle who
receives data and aggregates it locally, then returns the results to
workers in RDMA read way. This approach has little impact on
applications. And it improves the performance since it reduces
aggregation rounds compared to traditional ring-Allreduce method.
However, mechanisms such as transport reliability and flow control
are designed based on an server-to-server communication model, so
they need to be redesigned or adapted accordingly.
Yao, et al. Expires 8 August 2024 [Page 6]
Internet-Draft Collective Communication Optimizations: February 2024
Figure 5 shows the three phases of NetReduce protocol. First packet,
middle packet and the last packet. NetReduce header is built over
RoCEv2. NetReduce has similar function as SHARP, but it is designed
for aggregation used in ring-Allreduce, so it contains ring
information, message information, and rank information.
+---------------------------------------------------------+
| Switch |
| man-in-the-middle |
| |
| +---------------------------------------------------+ |
| | | |
| | +--------------+ +--------------+ | |
| | | | | | | |
+---------------------------------------------------------+
| | | | | |
| | | | | |
| | | | | |
+--v------+-+ +-v------+--+ +-v-------+-+
| worker 1 | | worker 2 | | worker 3 |
+-----------+ +-----------+ +-----------+
Figure 4: NetReduce Illustration
First Pkt
+---------+---------+---------+---------------+----------+------+
| UDP | IB BTH | IB RETH | NetReduce Hdr | Payload | ICRC |
+---------+---------+---------+---------------+----------+------+
Middle Pkt
+---------+---------+----------+------+
| UDP | IB BTH | Payload | ICRC |
+---------+---------+----------+------+
Last Pkt
+---------+---------+---------+---------+------+
| UDP | IB BTH | IB IMM | Payload | ICRC |
+---------+---------+---------+---------+------+
Figure 5: NetReduce Protocol
The design objective of [Cepheus] is for offloading Broadcast
operations. Multiple receivers first send RDMA related information,
like Queue Pair(QP) number and destination address, to the sender
host for registration, and Multicast Forwarding Tree(MFT) is built on
these information. Intermediate switches will make decisions based
on their downstream connectors. If leaf switch is directly connected
with the receiver host, it will work as a RDMA bridge by modifying
Yao, et al. Expires 8 August 2024 [Page 7]
Internet-Draft Collective Communication Optimizations: February 2024
data packets. In this way, multicast is done in the forward
direction, and Acknowledge signals are aggregated in reverse
direction to realize reliability. This kind of implementation incur
less modification to native RDMA and has better compatibility.
Figure 6 shows the Cepheus Multicast Registration Protocol(MRP).
Before starting to implement the Broadcast operations, the source of
the multicast propagates the MRP into the entire fabric to install
multicast forwarding table in each switch and build a MFT.
Recevier's RDMA information, i.e, QP number and destination IP
address are predefined before the MFT is set up. During multicast,
there is no real RDMA communication. Switches that are directly
connected with receivers will modify the packet header to make the
logical RDMA connection complete.
Cepheus MRP
+---------+----------+--------------+
| UDP | Metadata | Node Payload |
+---------+----------+--------------+
Metadata
+---------------+---------------+
| Total | Seq | Node Numbers |
+---------------+---------------+
Node Payload
+----------+---------+---------+
| Node QPN | Node IP | Reserve |
+----------+---------+---------+
Figure 6: Cepheus Multicast Registration Protocol
In RoCEv2 network ,there is also server-to-swtich mode where switches
implement RDMA protocol stack and workers establish RDMA connections
with switches. This approach is similar to InfiniBand SHARP, but
based on Ethernet. Due to capacity limitations, network devices do
not need to support complete RDMA transport protocol. Similarly, the
shortcoming of this mode is that it requires network devices to
support RDMA.
3.2.3. iWARP
iWARP[RFC5040] is another RDMA scheme for Ethernet based on TCP
protocol. Like RoCEv2, iWARP uses InfiniBand Verbs to interact with
applications.RDMAP (Remote Direct Memory Access Protocol) provides
RDMA semantic support for upper layer requests such as RDMA_Send,
RDMA_Read, RDMA_Write. DDP (Data Placement Protocol) implements zero
copy function. DDP Packet contains information describing the memory
Yao, et al. Expires 8 August 2024 [Page 8]
Internet-Draft Collective Communication Optimizations: February 2024
area. Hardware can directly move data in the DDP Packet to the
destination in memory through DMA based on the control information in
the DDP Packet . The above process does not require the involvement
of the CPU. MPA (Marker Protocol Data Unit Aligned Framing) is
responsible for adding control information to the TCP flow according
to a certain algorithm at the sending end, so that the receiving end
can recognize the boundaries of DDP Packet in the flow according to
the algorithm.
+--------------+--------------+--------------+-----------------+------------+----------+
| TCP Header | MPA Header | DDP Header | RDMA Header | Payload | MPA CRC |
+--------------+--------------+--------------+-----------------+------------+----------+
Figure 7: iWARP Protocol
Due to TCP ensuring packet ordered delivering and transmission
reliability, iWARP could adapt to larger network scales compared to
RoCEv2, but its performance is lower. Because of the high cost of
offloading complete TCP/IP stack to hardware and the resource
intensive maintenance of TCP protocol status, the use of iWARP is not
as widespread as RoCEv2.
In the server-to-server NACA based on iWARP, any change in Payload
may be considered as an interruption of the flow, and any packet loss
must be retransmitted. The transport layer mechanism is too complex
and difficult to modify.
In the server-to-switch NACA based on iWARP mode, due to resource
limitations, network devices do not need to implement a complete
protocol stack. It is necessary to clarify which parts of existing
protocols must be implemented. Meanwhile, if network devices
maintain TCP connections, they need to manage resources reasonably.
4. Requirements
4.1. NACA Function Design and Header Definition
NACA offloads collective operations with low computational precision
and high I/O communication to network device. Network devices not
only complete packet routing and forwarding, but also need to process
collective messages. Therefore, NACA functions should be designed to
instruct network devices to distinguish and process different
traffic. Accordingly, an NACA header should be designed over the
transport layer to complete the mapping mechanism between packets and
collective messages. Therefore, the following requirements are
proposed to support collective communication optimization:
Yao, et al. Expires 8 August 2024 [Page 9]
Internet-Draft Collective Communication Optimizations: February 2024
R1: MUST define a NACA header to indicate what collective operations
that switches need to offload, together with relevant information,
for example, message id, sequence number, job id etc.
R2: SHOULD support fallback mechanism, in case network devices are
not sufficient for processing complete collective operations.
4.2. Bridge RDMA Transport Semantics
As has been explained in previous sections, the major gap between
native RDMA and NACA is reflected in the transport semantics. There
need mechanisms for transport semantic bridging in order to combine
the high-performance transport capability of RDMA and NACA
functionality. Besides, NACA may not need full functionality of
native RDMA, and it is not ideal to implement full RDMA functionality
within switches, because of limited hardware resources. For example,
most of RDMA-based NACA solutions only call RDMA read, write, send,
and receive operations. Accordingly, the following requirements need
to be met:
R3: Transport layer MUST support RDMA function.
R4: SHOULD allow for different RDMA communication modes for NACA as
described in section 3.
R5: In server-to-swtich mode, SHOULD clarify which part of the RDMA
functions the switch supports, in order to establish a RDMA
connection with the server and complete NACA.
4.3. RDMA Transport Related Issues
As it has been analyzed in section 3 that IWARP solutions can not
work well with NACA, because it builds RDMA functions on top of TCP
which are too complex to implement in switches. The most promising
solution is RDMA over UDP, for example, RoCEv2. However, native
RoCEv2 has several limitations and can not work very well with NACA
in large scale clusters. These limitations are reflected in the
mechanisms of reliability, flow control, and congestion control. For
reliability, go-back-n packet retransmission is low efficient, and it
may incur much buffer occupancy in NACA switches. Priority Flow
Control(PFC) also has high requirement for buffer space, and for
Many-to-one collective operations, PFC will take up even more buffer
space. As for congestion control, there are lots of algorithms and
not all of them work well with NACA. A common congestion control
mechanism need to be designed. Thus, there are following
requirements:
Yao, et al. Expires 8 August 2024 [Page 10]
Internet-Draft Collective Communication Optimizations: February 2024
R6: NACA MUST be designed with reliability, and the reliability
mechanism of RoCEv2 SHOULD be modified to be more efficient.
R7: Flow control SHOULD be optimized in order to save more buffer and
memory space for NACA functions.
R8: The congestion control of NACA SHOULD work compatibly with other
congestion control mechanisms applied for other network traffic that
runs in the same fabric.
4.4. Joint Design of NACA Task Assignment and Routing Policies
Since AI model training tasks usually follow a predefined rule that
task as well as the training group are settle, and once the training
starts, there will be no more new comers to join the group. On basis
of this, NACA task assignment usually follows a centralized pattern.
For example, NACA support Allreduce by following Aggregation tree,
and support broadcast by building a multicast forwarding tree. While
some routing policies may follow distributed patterns. For example,
Adaptive Routing(AR) selects the optimal path at each network node
distributedly. These solutions may not co-exist with each other. In
order to better balance traffic management and task assignment:
R9: NACA Task assignment SHOULD be co-designed with routing policies
for joint optimization.
4.5. Security and Traffic Isolation
Due to situations of multi-tenancy, a single switch may need to
perform different NACA functions and forward normal traffic. Since
NACA header contains collective operations metadata and payload
parameters, if the switch logic designed for NACA is incorrectly
applied on normal traffic, there will be very severe security issues.
Hence, security requirements are as follows:
R10: Resources MUST be isolated on switches to ensure that different
tasks do not interfere each other, and NACA functions do not operate
on normal traffic.
4.6. Fault Tolerance
Fault tolerance is required since there is a chance that single
network device may run out of service, due to either single point
failure or link break down. Therefore:
R11: The mechanism of choosing alternative node for implementing NACA
functions MUST be designed, to ensure system robustness and
reliability.
Yao, et al. Expires 8 August 2024 [Page 11]
Internet-Draft Collective Communication Optimizations: February 2024
5. Security Considerations
Some security concerns have been described in the
[I-D.yao-tsvwg-cco-problem-statement-and-usecases].
6. Operational Considerations
Use cases like AI model training, distributed storage, and big data
analysis usually need infrastructure to be deployed in clusters which
are operated by single entities, for example, limited domain
[RFC8799]. In this case, not only the compute and network
infrastructure, but also the application could be owned by single
service providers. These use cases are typically performance-driven,
which means they need application and infrastructure to be co-
designed to reach optimization. However, applications are not co-
designed with underlying network protocols case-by-case, as long as
the definition and realization of certain collective operations that
would be offloaded can be reached in consensus across vendors, like
unified primitives used for implementing the collective
communication, applications can leverage on the standardized north
bound API to improve performance, albeit the applications do not
belong to the same service providers.
7. IANA Considerations
TBD.
8. References
8.1. Normative References
[I-D.yao-tsvwg-cco-problem-statement-and-usecases]
Yao, K., Shiping, X., Li, Y., Huang, H., and D. KUTSCHER,
"Collective Communication Optimization: Problem Statement
and Use cases", Work in Progress, Internet-Draft, draft-
yao-tsvwg-cco-problem-statement-and-usecases-00, 23
October 2023, <https://datatracker.ietf.org/doc/html/
draft-yao-tsvwg-cco-problem-statement-and-usecases-00>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
Garcia, "A Remote Direct Memory Access Protocol
Specification", RFC 5040, DOI 10.17487/RFC5040, October
2007, <https://www.rfc-editor.org/info/rfc5040>.
Yao, et al. Expires 8 August 2024 [Page 12]
Internet-Draft Collective Communication Optimizations: February 2024
[RFC8799] Carpenter, B. and B. Liu, "Limited Domains and Internet
Protocols", RFC 8799, DOI 10.17487/RFC8799, July 2020,
<https://www.rfc-editor.org/info/rfc8799>.
8.2. Informative References
[Cepheus] Zhang, W. L. A. J., "Cepheus: Accelerating Datacenter
Applications with High-Performance RoCE-Capable
Multicast", 2024.
[NetReduce]
Liu, S., "In-Network Aggregation with Transport
Transparency for Distributed Training",
DOI 10.1145/3582016.3582037, 2023,
<https://doi.org/10.1145/3582016.3582037>.
[SHARP] Graham, R. L., "Scalable Hierarchical Aggregation and
Reduction Protocol (SHARP): A Hardware Architecture for
Efficient Data Reduction", DOI 10.1109/COMHPC.2016.006,
2023, <https://doi.org/10.1109/COMHPC.2016.006>.
Authors' Addresses
Kehan Yao
China Mobile
Beijing
100053
China
Email: yaokehan@chinamobile.com
Shiping Xu
China Mobile
Beijing
100053
China
Email: xushiping@chinamobile.com
Yizhou Li
Huawei Technologies
Nanjing, Jiangsu
China
Email: liyizhou@huawei.com
Yao, et al. Expires 8 August 2024 [Page 13]
Internet-Draft Collective Communication Optimizations: February 2024
Hongyi Huang
Huawei Technologies
Beijing
China
Email: hongyi.huang@huawei.com
Weifeng Wang
New H3C Technologies Co., Ltd
Beijing
China
Email: wangweifeng@h3c.com
Dirk KUTSCHER
THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY (GUANGZHOU)
Guangzhou
China
Email: dku@hkust-gz.edu.cn
Yao, et al. Expires 8 August 2024 [Page 14]