Internet DRAFT - draft-liu-nfsv4-rocev2
draft-liu-nfsv4-rocev2
NFSV4 F. Liu
Internet Draft W. Wang
Intended status: Standards Track R. Liu
Expires: August 2024 H3C
Y. Mu
K. Yao
China Mobile
February 28, 2024
RoCEv2-based Collective Communication Offloading
draft-liu-nfsv4-rocev2-00.txt
Abstract
This draft proposes the design scheme of RoCEv2-based collective
communication offloading. Through establishing RDMA connections
between client and switch, collective operations can be implemented
on network nodes, thus improving the overall efficiency of collective
communication.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. This document may not be modified,
and derivative works of it may not be created, and it may not be
published except as an Internet-Draft.
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. This document may not be modified,
and derivative works of it may not be created, except to publish it
as an RFC and to translate it into languages other than English.
This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November 10,
2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
Liu, et al. Expires August 28, 2024 [Page 1]
Internet-Draft RoCEv2 CCO February 2024
it for publication as an RFC or to translate it into languages other
than English.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
This Internet-Draft will expire on August 28, 2024.
Copyright Notice
Copyright (c) 2024 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents carefully,
as they describe your rights and restrictions with respect to this
document. Code Components extracted from this document must include
Simplified BSD License text as described in Section 4.e of the Trust
Legal Provisions and are provided without warranty as described in
the Simplified BSD License.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents carefully,
as they describe your rights and restrictions with respect to this
document.
Table of Contents
1. Introduction...................................................3
2. Terminology and Definitions....................................4
Liu, et al. Expires August 28, 2024 [Page 2]
Internet-Draft RoCEv2 CCO February 2024
3. Architecture...................................................4
3.1. In-network Computing Aggregation Manager..................6
3.2. In-network Computing Switch...............................7
3.3. In-network Computing Client...............................8
4. Deployment.....................................................9
5. Interaction Process...........................................12
5.1. Control plane............................................12
5.2. Forwarding plane.........................................13
6. Packet encapsulation..........................................15
7. Transport layer requirements..................................16
8. Security Considerations.......................................17
9. IANA Considerations...........................................17
10. References...................................................17
10.1. Normative References....................................17
10.2. Informative References..................................18
11. Acknowledgments..............................................18
1. Introduction
Collective communication means that within a network, multiple
computers or devices communicate through shared resources and
cooperation to achieve more efficient and secure data transmission
and information exchange. Detailed use cases and problems are
proposed in [I-D.yao-tsvwg-cco-problem-statement-and-usecases].
Various collective communication operations are used in both
artificial intelligence (AI) and high performance computing (HPC)
workloads, including:
1. Broadcast - spread data from one member to all other members.
2. AllGather - collect data from all members and spread it to all
member.
3. AllToAll - distribute different data from all members to all other
members.
4. Scatter - distribute different data from one member to all other
members.
5. Gather - collect data from all members and send to one member.
6. Reduce - merge data of all members send to one member.
7. AllReduce - merge data of all members and spread it to all members.
8. ReduceScatter - merge different part of data of all members, and
distribute it to all members.
Liu, et al. Expires August 28, 2024 [Page 3]
Internet-Draft RoCEv2 CCO February 2024
9. Barrier - synchronize among all members.
In-network computing enables network device to participate in
collective communication by offloading the collective communication
operations frequently used by HPC and AI to network nodes. The
acceleration of collective communication through in-network computing
is of great significance, which is mainly reflected in the following
aspects. The requirement and analysis are described in [I-D.draft-
yao-tsvwg-cco-requirement-and-analysis]. From an application point of
view, in-network computing can significantly reduce the communication
traffic, thus improving the overall computing efficiency and the
overall application performance. From the point of view of resource
utilization, computational tasks of processors are shared, thus the
computation speed is accelerated, and the overall resource
utilization is improved. From the network point of view, the data
flow of the network is reduced and congestion is relieved, so that
the network utilization is improved.
2. Terminology and Definitions
The following terms are used in this document:
Aggregation
The act of collecting and reducing input data from one or more group
members.
Collective
Collective Operation - an operation done by a group of ranks.
Collective Group
A set of ranks that participate in a collective operation.
INC
In-network computing.
INC-switch
Switch with the capability to support INC.
3. Architecture
Figure 1 illustrates a conceptual architecture of in-network
computing.
Liu, et al. Expires August 28, 2024 [Page 4]
Internet-Draft RoCEv2 CCO February 2024
+------------------------+
| |
| INC Switch |
| |
+-------------+ +------------+ |
| | | Switch Chip| |
| | +------------+ |
| | |
+---+----+ +------/----------\------+
| INC AM | // \\
+---+----+ / \\
| / RDMA-RoCEv2 \
| // \\
| / \\
| +----------/----------+ +---------\-----------+
| | | | |
+-| INC Client | | INC Client |
| | | |
| +-------------+ | | +-------------+ |
| | GPU | | | | GPU | |
| +-------------+ | | +-------------+ |
| | | |
+---------------------+ +---------------------+
Figure 1 Architecture
In order to offload collective communication, the architecture of in-
network computing will mainly be composed of three parts: In-network
Computing Aggregation Manager, In-network Computing Switch, and In-
network Computing Client.
o In-network Computing Aggregation Manager (INC AM): It is the
controller of the entire in-network computing, mainly responsible
for the generation and management of the Aggregation Tree, issuing
in-network computing related flow table to the switch, and real-
time monitoring of the in-network computing task status.
o In-network Computing Switch (INC Switch): It is the core that
offloads collective communication to network devices. It performs
specific collective communication operations by receiving
corresponding data and operation methods from the in-network
computing client, and finally sends the results to the in-network
computing client. It also provides related operation and
maintenance data, such as in-network computing related task and
message statistics.
Liu, et al. Expires August 28, 2024 [Page 5]
Internet-Draft RoCEv2 CCO February 2024
o In-network Computing Client (INC client): It is the data source
that needs to perform collective communication in in-network
computing. It is deployed in the computing nodes and is used to
integrate with MPI (Message Passing Interface) library, NCCL
(NVIDIA Collective Communication Library) to send collective
communication data to the in-network computing switch.
3.1. In-network Computing Aggregation Manager
The main function of the in-network computing Aggregation Manager is
to coordinate the establishment and dismantling of the collection
communication group. At the same time, it also provides the ability
to collect and manage the lifecycle of the collection communication
group, and monitors the in-network computing switches and in-network
computing clients through heartbeat detection.
The in-network computing Aggregation Manager must be deployed in a
location that can access to the in-network computing switches and in-
network computing clients; it connects to the in-network computing
clients via gRPC and to the in-network computing switches via NETCONF.
o Topology Information. The in-network computing aggregation
manager must be able to obtain network topology and the
capabilities of in-network computing switches, and display all in-
network computing clients and in-network computing switches, as
well as their connection relationships. The document focuses
specifically on the tree topology, and does not discuss other
topologies.
o Establishment of Collection Communication Group. When using
offloading mode in collective communication, the in-network
computing aggregation manager needs to calculate and determine
which in-network computing switches have the capability and
resources, and establish an aggregation tree. All unsupported
devices will be excluded from the aggregation tree.
1. Select a root switch, generally the position of the root is the
spine switch, so that all subsequent leaf switches can communicate
directly with the root.
2. Select the communication link between the root and leaf switches.
3. The in-network computing Aggregation Manager configures the in-
network computing switches via NETCONF, obtains such as capability,
RDMA information etc. from the in-network computing switches, and
sends it to the in-network computing clients.
Liu, et al. Expires August 28, 2024 [Page 6]
Internet-Draft RoCEv2 CCO February 2024
4. If there are any change in the topology during the lifecycle, the
in-network computing Aggregation Manager needs to dismantle the
collection communication group or establish a new aggregation tree.
o Dismantling of Collection Communication Group. The conditions
for dismantling the collection communication group include:
1. In-network computing clients leaving the collection communication
group.
2. Failure of heartbeat detection for in-network computing clients.
3. Link failure.
4. Manual dismantling.
o Resource Allocation. Resource allocation and distribution are
required in-network computing, and its main functions include:
1. Responsible for allocating identity identifiers for in-network
computing: assigning identity identifiers to each in-network
computing switch in the aggregation tree; mapping the identity
identifiers of in-network computing clients to the identity
identifiers of in-network computing switches.
2. Responsible for establishing QP in the RDMA protocol.
3. Distribute the in-network computing forwarding table to in-network
computing clients and in-network computing switches: the in-network
computing Aggregation Manager generates forwarding table based on the
aggregation tree and distributes the forwarding table to in-network
computing clients and in-network computing switches.
4. Monitor the status of in-network computing tasks: the in-network
computing Aggregation Manager is responsible for monitoring the
running status of in-network computing clients and in-network
computing switches, including task status and statistics.
3.2. In-network Computing Switch
The in-network computing switch offloads collective communication of
in-network computing clients. The in-network computing switch is
directly or indirectly connected to the in-network computing clients
and serves as the core for offloading collective communication to
network devices. It performs specific collective communication
operations by receiving corresponding data and instructions from in-
network computing clients, and ultimately sends the results back to
Liu, et al. Expires August 28, 2024 [Page 7]
Internet-Draft RoCEv2 CCO February 2024
the client or clients. The interface and functions between the in-
network computing switch and the in-network computing aggregation
manager include:
o In-network computing related configuration processing,
specifically including: configuring in-network computing
management addresses, querying in-network computing aggregation
trees, querying in-network computing statistics, and providing
corresponding NETCONF interfaces.
o In-network computing packet parsing and encapsulation: parsing
in-network computing packets sent from in-network computing
clients, performing in-network computing processing, and then re-
encapsulating the in-network computing packets to send to in-
network computing clients or in-network computing root and leaf
switches.
o Performing in-network computing processing based on the in-
network computing forwarding table: supporting collective
communication operations such as AllReduce, Broadcast, Barrier,
etc.
o Providing in-network computing statistics: including packet
statistics based on identity and packet statistics based on QP
(Queue Pair).
3.3. In-network Computing Client
The in-network computing client needs to integrate with collective
communication library. OpenMPI and NCCL define the standard MPI
collective communication interface, but allow third-party to have its
own implementation. By developing the INC Client to implement the
docking with the MPI collective communication interface of OpenMPI
and NCCL, and implementing the MPI collective communication algorithm
in in-network computing, this INC client can be integrated into the
communication library through plugins or embedded directly.
When the application calls the MPI_AllReduce interface of OpenMPI or
NCCL, it directly calls the processing in the INC Client. The INC
Client sends the data of MPI collective communication in the
encapsulation format of in-network computing to the in-network
computing switch. The INC Client is also responsible for receiving
the in-network computing packets in response from the in-network
computing switch and returning them to the upper-layer application.
The in-network computing client needs to have the following functions:
Liu, et al. Expires August 28, 2024 [Page 8]
Internet-Draft RoCEv2 CCO February 2024
o Deployed within the computing node, used for integration with
MPI library and NCCL library: it needs to provide plugin for
integration with OpenMPI and NCCL respectively, as well as INC
Client lib; INC Client starts with the start of the MPI process
and stops with the stop of the MPI process.
o Responsible for sending and receiving in-network computing
packets: the in-network computing client sends in-network
computing packets to the in-network computing switch based on the
forwarding table issued by the in-network computing aggregation
manager (including identity, task identification, QP, switch IP,
etc.).
o Provide interface for querying in-network computing task-
related information: mainly including in-network computing task
status, data block size, identity, task identification, QP,
message statistics.
o Provide INC Client log.
4. Deployment
Considering that the scale of networking can vary according to the
size of AI training, in-network computing needs to support single-
level aggregation and multi-level aggregation. In general, the
single-level aggregation method can meet the requirements of in-
network computing. If the aggregation capacity of the in-network
computing switch is insufficient, or in order to save bandwidth
between switches, a multi-level aggregation method can be adopted.
The networking diagram for single-level aggregation is as follows:
Liu, et al. Expires August 28, 2024 [Page 9]
Internet-Draft RoCEv2 CCO February 2024
+-------------------------------------+
| INC Switch |
| |
| +-------------+ |
+---------+ | Leaf1 | |
| | | | |
| | | AllReduce | |
| | +-/-+-------+-+ |
+---+----+ +---------//--+-------+-\\------------+
| INC AM | / | | \\
+---+----+ / | | \\
| // | | \\
| / | | \\
| / | | \\
| +---------//----------+-------+-------------\\--------+
| | +------/---+ +------+---+ +-+--------+ +----\-----+ |
+-| |INC Client| |INC Client| |INC Client| |INC Client| |
| | | | | | | | | |
| | Worker1 | | Worker2 | | Worker3 | | Worker4 | |
| +----------+ +----------+ +----------+ +----------+ |
+-----------------------------------------------------+
Figure 2 Single-level Aggregation Network
In a single-level aggregation network environment, the following
operations need to be implemented:
o The in-network computing aggregation manager generates
aggregation trees and assigns Tree IDs for different computing
tasks, and then sends the aggregation tree information to the
switch.
o The in-network computing switch performs local aggregation
based on the aggregation tree information upon receiving packets
from the in-network computing client.
o Broadcast the local aggregation results to the in-network
computing client.
Liu, et al. Expires August 28, 2024 [Page 10]
Internet-Draft RoCEv2 CCO February 2024
+----------------------------------------+
| INC Switch |
| +-------------+ +-------------+ |
| | Spine1 | | Spine2 | |
| | | | AllReduce | |
+---------+ +-+---------\\+ +/----------+-+ |
| | | \\// | |
| | | //\\ | |
| |+-----+-------+ // \\ +-------+-----+|
+---+----+ || Leaf1 |/ \| Leaf2 ||
| INC AM | || AllReduce | | AllReduce ||
+---+----+ |++--------+---+ +-+----------++|
| +-+--------+----------------+----------+-+
| | | | |
| | | | |
| +---------+--------+----------------+----------+------+
| | +-------+--+ +---+------+ +-------+--+ +-----+----+ |
+-| |INC Client| |INC Client| |INC Client| |INC Client| |
| | | | | | | | | |
| | Worker1 | | Worker2 | | Worker3 | | Worker4 | |
| +----------+ +----------+ +----------+ +----------+ |
+-----------------------------------------------------+
Figure 3 Multi-level Aggregation Network
In a multi-level aggregation network environment, the following
operations need to be implemented:
o The in-network computing aggregation manager generates
aggregation trees and assigns Tree IDs for different computing
tasks, then sends the aggregation tree information to the switch,
and informs the switch of its role: leaf or root.
o The in-network computing switch first performs local
aggregation based on the aggregation tree information upon
receiving data packets from lower-level nodes.
o If it is the root, it indicates that the aggregation is
completed, and broadcasts the aggregation result to all members.
o If it is not the root, it indicates the need for multi-level
aggregation, and sends the local aggregation result to the upper-
level in-network computing switch for further aggregation.
Liu, et al. Expires August 28, 2024 [Page 11]
Internet-Draft RoCEv2 CCO February 2024
o When a leaf in-network computing switch receives the
aggregation result from the upper-level in-network computing
switch, it continues to broadcast the aggregation result to the
members at the lower level.
5. Interaction Process
The interaction process in-network computing mainly consists of two
parts, namely the control plane and the forwarding plane. The control
plane is responsible for establishing, resource allocation/release,
and dismantling of communication groups in-network computing; the
forwarding plane is responsible for executing the data processing
tasks of specific communication groups in-network computing.
5.1. Control plane
The deployment architecture model starts from the in-network
computing client joining the collective communication group. The in-
network computing aggregation manager allocates the corresponding
resources for in-network computing by establishing the collective
communication group. The in-network computing aggregation manager
needs to be deployed in a network environment accessible between the
in-network computing switch and the in-network computing client, and
then needs to complete the registration of the in-network computing
switch capability, discover the topology between the in-network
computing switch and the in-network computing client, and
allocate/release the resources of the in-network computing switch
according to the requirements of the in-network computing client for
the collective communication group. Communication between the in-
network computing clients and the in-network computing switches, and
between in-network computing switches, is done through the RDMA
protocol, so before RDMA communication, it is necessary to apply for
QPN and create QP. Resources can be allocated through CM
(Communication Management), resource allocation can be done through
the Socket API, or resources can be allocated through the in-network
computing aggregation manager.
(1) Building a connection between RDMA QPs based on the Socket API
requires establishing a TCP/IP connection between two nodes through
the Socket API, and then using this connection to exchange
information about both QPs. The application program implements the
TCP/IP three-way handshake, data exchange, and four-way handshake
process by calling the Socket API according to the process, and then
starts to exchange information such as QPN.
(2) CM is a mechanism specifically used in RDMA technology to
establish connections between QPs. It has a set of exclusive message
Liu, et al. Expires August 28, 2024 [Page 12]
Internet-Draft RoCEv2 CCO February 2024
formats, interaction processes, and user interfaces. The CM protocol
establishes connections through multiple round-trip messages, and it
also specifies the way to disconnect. Users control the CM to send
and receive CM protocol messages through the CM programming interface,
completing the interaction of GID, QPN, and other information.
(3) Considering the complexity of implementation, it is also possible
to allocate QPN and QP for the switches in in-network computing
through the in-network computing aggregation manager, and the QPN and
QP allocation for the in-network computing client is done by the
client itself, and the allocated information is synchronized to the
in-network computing aggregation manager for unified pairing and
management.
5.2. Forwarding plane
The entire forwarding plane in-network computing starts with the
client initiating a data packet. The in-network computing switches
receive the data from the in-network computing client based on the
generated topology graph and process it. The data is then broadcasted
to all member clients. Considering the complexity of multi-level
aggregation, the overall process is divided into upstream and
downstream processes.
We assume Work1 and Work2 are attached to leaf1 switch; Work3 and
Work4 are attached to Leaf2 switch; Leaf1 switch and Leaf2 switch are
attached to root spine switch. The specific upstream process is shown
in the following.
o Work1 and Work2 will send the messages calculated on the
network to Leaf1 according to the message format of RoCEv2,
carrying corresponding information such as QP and tree.
o Leaf1 receives the information from Work1 and Work2, aggregates
it locally, and then sends it to the spine switch, carrying
information such as QP and tree.
o Work3 and Work4 will send the messages calculated on the
network to Leaf2 according to the message format of RoCEv2,
carrying corresponding information such as QP and tree.
o Leaf2 receives the information from Work3 and Work4, aggregates
it locally, and then sends it to the spine switch, carrying
information such as QP and tree.
o The spine switch receives the information from Leaf1 and Leaf2
and completes the aggregation.
Liu, et al. Expires August 28, 2024 [Page 13]
Internet-Draft RoCEv2 CCO February 2024
The specific downstream process is shown in the following.
o After the spine switch completes the aggregation, it locally
replicates the aggregation result and sends it to the Leaf1 and
Leaf2 switches respectively, carrying QP, tree, and other
information.
o Leaf1 receives the aggregation result from the spine, completes
the local replication, and then sends it to work1 and Work2,
carrying QP, tree, and other information.
o Leaf2 receives the aggregation result from the spine, completes
the local replication, and then sends it to Work3 and Work4,
carrying QP, tree, and other information.
In-network computing switch is crucial for completing the aggregation
operation. Let's introduce the handling of aggregation on the switch.
+-------------------------------------------------------------------+
| Tree ID=1 |
| slot0 slot255 |
| +-------+------+-----+------+ +-------+------+-----+------+|
|sum | msg_0 |sum_1 | ... |sum_k |...|msg_255|sum_1 | ... |sum_k ||
| +-------+------+-----+------+ +-------+------+-----+------+|
| |
| +-------+------+-----+------+ +-------+------+-----+------+|
|rank0 | msg_0 |fp32_1| ... |fp32_k|...|msg_255|fp32_1| ... |fp32_k||
| +-------+------+-----+------+ +-------+------+-----+------+|
| |
| +-------+------+-----+------+ +-------+------+-----+------+|
|rank1 | msg_0 |fp32_1| ... |fp32_k|...|msg_255|fp32_1| ... |fp32_k||
| +-------+------+-----+------+ +-------+------+-----+------+|
| |
| +-------+------+-----+------+ +-------+------+-----+------+|
|rank63| msg_0 |fp32_1| ... |fp32_k|...|msg_255|fp32_1| ... |fp32_k||
| +-------+------+-----+------+ +-------+------+-----+------+|
+-------------------------------------------------------------------+
Figure 4 The aggregation operation
As shown in the figure above, assuming:
o For a certain tree id, the in-network computing switch needs to
process 64 workers (represented by rank0-rank63).
Liu, et al. Expires August 28, 2024 [Page 14]
Internet-Draft RoCEv2 CCO February 2024
o Each rank sends 256 messages at a time (represented by
message0-message255).
o The in-network computing switch creates 256 aggregator pools
(corresponding to slot0-slot255) for this tree id, with each slot
responsible for aggregating a column of messages.
For each slot, it is necessary to check the arrival status of the
rank's data under that slot. For example, for slot0, when the
aggregated messages sent by rank0-rank63 are all received and the
tree id and message id are checked to be the same, the aggregation
operation is performed on each data (from data1 to datak) in these
messages:
o Aggregate rank0 data1, rank1 data1, and so on, up to rank63
data1.
o Aggregate rank0 data2, rank2 data2, and so on, up to rank63
data2.
o Continue until rank0 datak, rank1 datak, and so on, up to
rank63 datak.
After completing all the data aggregation, perform different
operations based on whether the role is root or leaf:
o If it is root, send the aggregated result to the leaf and clear
the data under the slot, updating the expected message id.
o If it is leaf, send the aggregated result of slot x to the root
switch, and wait to receive the final aggregated result from the
root before clearing the data under the slot and updating the
expected message id.
Each slot runs independently and does not interfere with each other.
When a slot completes processing, it can initiate the processing of
the next message id separately.
6. Packet encapsulation
Communication between in-network computing switches and in-network
computing clients is done through RDMA. RDMA communication requires a
lossless network environment, so in an Ethernet environment, they
communicate data messages for in-network computing through RoCEv2.
RDMA generally uses RC (Reliable Connection) mode and UC (Unreliable
Connection) mode. In RC mode, it supports message acknowledgment
confirmation and timeout retransmission. If a message times out
Liu, et al. Expires August 28, 2024 [Page 15]
Internet-Draft RoCEv2 CCO February 2024
without confirmation, all subsequent messages after that will be
retransmitted. In UC mode, a link needs to be established in advance,
messages do not need to carry address information, do not support
acknowledgment confirmation or retransmission, and do not guarantee
that the other end can receive them correctly.
Using the standard Ethernet/IP message format, UDP port number 4791
represents RoCEv2 messages; using the Basic Transport Header (BTH)
containing fields that are always present in all IBA transport
services; using the RDMA Extension Transport Header (RETH) of 16
bytes, which includes additional transport fields for RDMA operations;
using the Immediate Extension Transport Header (IMMDT) of 4 bytes,
followed by the placement of data information related to in-network
computing.
The specific message mainly contains key information for executing
data information related to in-network computing, which includes the
following:
(1) Aggregation Tree ID: representing collective communication.
(2) Collective communication type, including specific operations to
be performed, such as AllReduce, Broadcast, Barrier, etc.
(3) Data type: including specific data types to be executed, such as
IEEE754 floating point in 16, 32, 64 bits, etc.
(4) Operation type, including specific operation types of the in-
network computing switch after receiving the collective communication
message, such as Sum (add the data together), Min (find the minimum
value), Max (find the maximum value), etc.
(5) The Payload section contains the data that is specifically
transferred through RDMA in in-network computing.
7. Transport layer requirements
Data packets may be lost due to link quality, switch buffer overflow,
or other abnormal conditions. If packet loss occurs, the client in
in-network computing is responsible for retransmission. If RC mode is
used, all retransmissions are guaranteed by the RDMA transport layer.
If UC mode is used, the retransmission process for in-network
computing is as follows:
(1) The in-network computing client sends a packet with MessageID = n
and starts the packet retransmission timer.
Liu, et al. Expires August 28, 2024 [Page 16]
Internet-Draft RoCEv2 CCO February 2024
(2) If the corresponding response packet with MessageID = n is
received before the retransmission timer times out, the next
MessageID packet is sent and the packet retransmission timer is reset.
(3) If the packet retransmission timer times out, the packet with
MessageID = n is retransmitted until it is successfully sent.
(4) A threshold N can be set to indicate that if N timeouts occur
without successful transmission, the aggregation manager should be
notified for error handling.
The in-network computing switch passively processes the data packets,
and in order to determine whether the received packet is a
retransmitted packet and prevent duplicate aggregation of packets,
the switch needs to record whether the corresponding MessageID packet
has been received.
8. Security Considerations
In network computing scheme may introduce some security and privacy
concerns.
Offloading collective operations may introduce new risks to the
network. The content of the information exchanged among the INC
aggregation manager, INC switches, and INC hosts may be topologically
sensitive. It is possible to disclose the location information of
computing resources hosted in the network and service sites, and
attackers can use this information to identify vulnerable points in
the network. For example, an attacker may take advantage of tampering
with network topology information to interrupt customer service
delivery, or even direct traffic to other places. The solution should
support authentication and integrity protection mechanisms to enhance
security.
9. IANA Considerations
TBD
10. References
10.1. Normative References
[RFC5531] Thurlow, R., "RPC: Remote Procedure Call Protocol
Specification Version 2", RFC 5531, May 2009,
<https://www.rfc-editor.org/info/rfc5531>.
Liu, et al. Expires August 28, 2024 [Page 17]
Internet-Draft RoCEv2 CCO February 2024
[RFC6241] R. Enns, M. Bjorklund, J. Schoenwaelder, A. Bierman,
"Network Configuration Protocol (NETCONF)", RFC 6241, June
2011, <https://www.rfc-editor.org/info/rfc6241>.
10.2. Informative References
[I-D.yao-tsvwg-cco-problem-statement-and-usecases] K. Yao, S. Xu, Y.
Li, H. Huang, D. KUTSCHER, "Collective Communication
Optimization: Problem Statement and Use cases", Work in
Progress, Internet-Draft, draft-yao-tsvwg-cco-problem-
statement-and-usecases-00, 23 October 2023,
<https://datatracker.ietf.org/doc/draft-yao-tsvwg-cco-
problem-statement-and-usecases/>.
[I-D.draft-yao-tsvwg-cco-requirement-and-analysis] K. Yao, S. Xu, Y.
Li, H. Huang, W. Wang, D. KUTSCHER, "Collective
Communication Optimizations: Requirement and Analysis",
Work in Progress, Internet-Draft, draft-yao-tsvwg-cco-
requirement-and-analysis-01, 5 February 2024,
<https://datatracker.ietf.org/doc/draft-yao-tsvwg-cco-
requirement-and-analysis/>.
11. Acknowledgments
TBD
Liu, et al. Expires August 28, 2024 [Page 18]
Internet-Draft RoCEv2 CCO February 2024
Authors' Addresses
Feng Liu
New H3C Technologies Co., Ltd
Hangzhou, China
Email: 11957147@qq.com
Weifeng Wang
New H3C Technologies Co., Ltd
Beijing, China
Email: wangweifeng@h3c.com
Rubing Liu
New H3C Technologies Co., Ltd
Hangzhou, China
Email: liurubing@h3c.com
Yan Mu
China Mobile
Beijing, China
Email: muyan@chinamobile.com
Kehan Yao
China Mobile
Beijing, China
Email: yaokehan@chinamobile.com
Liu, et al. Expires August 28, 2024 [Page 19]