Internet DRAFT - draft-wh-rtgwg-application-aware-dc-network
draft-wh-rtgwg-application-aware-dc-network
Network Working Group H. Wang
Internet-Draft Huawei
Intended status: Standards Track K. Yao
Expires: 2 September 2024 China Mobile
W. Pan
H. Huang
Huawei
1 March 2024
Application-aware Data Center Network (APDN) Use Cases and Requirements
draft-wh-rtgwg-application-aware-dc-network-02
Abstract
The deployment of large-scale AI services within data centers
introduces significant challenges to established technologies,
including load balancing and congestion control. Additionally, the
adoption of cutting-edge network technologies, such as in-network
computing, is on the rise within AI-centric data centers. These
advanced network-assisted application acceleration technologies
necessitate the flexible exchange of cross-layer interaction
information between end-hosts and network nodes.
The Application-aware Data Center Network (APDN) leverages the
Application-aware Networking (APN) framework for application side to
furnish the data center network with detailed application-aware
information. This approach facilitates the rapid advancement of
network-application co-design technologies. This document delves
into the use cases of APDNs and outlines the associated requirements,
setting the stage for enhanced performance and efficiency in data
center operations tailored to the demands of AI services.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
Wang, et al. Expires 2 September 2024 [Page 1]
Internet-Draft APDN March 2024
This Internet-Draft will expire on 2 September 2024.
Copyright Notice
Copyright (c) 2024 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4
1.2. Requirements Language . . . . . . . . . . . . . . . . . . 4
2. Use Case and Requirements for Application-aware Data Center
Network . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1. Fine-grained packet scheduling for load balancing . . . . 4
2.2. Enhancing Distributed Machine Learning Training with
In-Network Computing . . . . . . . . . . . . . . . . . . 6
2.3. Enhanced Congestion Control with Precise Feedback
Mechanisms . . . . . . . . . . . . . . . . . . . . . . . 8
3. Encapsulation . . . . . . . . . . . . . . . . . . . . . . . . 9
4. Security Considerations . . . . . . . . . . . . . . . . . . . 9
5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9
6. References . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.1. Normative References . . . . . . . . . . . . . . . . . . 9
6.2. Informative References . . . . . . . . . . . . . . . . . 10
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 11
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 11
1. Introduction
The advent of large AI models like AlphaGo and ChatGPT4 has
positioned distributed training for AI large models as a pivotal
operation within large-scale data centers. To enhance the efficiency
of training these substantial models, a significant number of
computing units—such as thousands of GPUs operating in tandem—are
deployed for parallel processing, aiming to minimize job completion
time (JCT). This setup necessitates frequent and bandwidth-heavy
communications among concurrent computing nodes, introducing a novel
Wang, et al. Expires 2 September 2024 [Page 2]
Internet-Draft APDN March 2024
multi-party communication mode that demands heightened throughput
performance, load balancing proficiency, and congestion management
capabilities from the data center network.
Traditionally, data center technology primarily views the network as
a mere conduit for data transmission for upper-layer applications,
offering basic connectivity services. Yet, the scenario of large AI
model training is increasingly incorporating network-assisted
technologies, such as offloading parts of the computation to the
network. This approach seeks to boost AI job efficiency through the
joint optimization of network communication and computing
applications. In many current instances of network assistance,
operators tailor and implement proprietary protocols on a limited
scale, leading to a lack of widespread interoperability.
However, as AI data centers grow and diversify in offering cloud
services for various AI tasks, emerging data center network
technologies must account for serving different transports and
applications. Building large-scale data centers now involves not
just ensuring device interoperability but also facilitating
interaction between network devices and end-host services.
This document illustrates use cases that requires application-aware
information between network nodes and applications. Current ways of
conveying information are limited by the extensibility of packet
headers, where only coarse-grained information can be transmitted
between the network and the host through a limited space (for
example, one-bit ECN [RFC3168] or DSCP in IP layer).
The Application-aware Networking (APN) framework
[I-D.li-apn-framework] delineates how application-aware information,
including APN identification (ID) and/or parameters (e.g., network
performance requirements), is encapsulated by network edge devices.
This information is then carried in packets across an APN domain to
support service provisioning, enable fine-grained traffic steering,
and adjust network resources. An extension of the APN framework
caters to the application side [I-D.li-rtgwg-apn-app-side-framework],
allowing APN domain resources to be allocated to applications that
encapsulate the APN attribute in packets.
This document delves into the application side of the APN framework
to foster enriched interaction between hosts and networks within the
data center, outlining several use cases and the corresponding
requirements for Application-aware Data center Network (APDN).
Wang, et al. Expires 2 September 2024 [Page 3]
Internet-Draft APDN March 2024
1.1. Terminology
APDN: APplication-aware Data center Network
SQN: SeQuence Number
TOR: Top Of Rack switch
PFC: Priority-based Flow Control
NIC: Network Interface Card
ECMP: Equal-Cost Multi-Path routing
AI: Artificial Intelligence
JCT: Job Completion Time
PS: Parameter Server
INC: In-Network Computing
APN: APplication-aware Network
1.2. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
2. Use Case and Requirements for Application-aware Data Center Network
2.1. Fine-grained packet scheduling for load balancing
Traditional data centers utilize the per-flow Equal-Cost Multi-Path
(ECMP) method to distribute traffic evenly across several paths.
These centers, primarily focused on cloud computing, handle a vast
number of data flows. Despite the large quantity, these flows are
predominantly small and short-lived, allowing the ECMP method to
facilitate a nearly uniform traffic distribution across multiple
pathways.
Wang, et al. Expires 2 September 2024 [Page 4]
Internet-Draft APDN March 2024
Contrastingly, the communication dynamics shift markedly during the
training of large AI models. This process demands unprecedented
bandwidth levels, where a singular data flow between multiple
machines could potentially max out the upstream bandwidth of a
server’s egress Network Interface Controller (NIC), with single data
flow throughputs approaching or exceeding 100GB x X.
Applying traditional per-flow ECMP strategies, such as hash-based or
round-robin algorithms, often results in the concurrent allocation of
large ("elephant") flows to a single pathway. This can lead to
severe congestion, notably when two simultaneous 100Gb/s flows vie
for the same 100Gb/s bandwidth, significantly impacting the
completion time for AI jobs.
To mitigate these issues, there's a pivotal shift towards
implementing a fine-grained, per-packet ECMP strategy. This approach
ensures the distribution of all packets from a single flow across
multiple paths, enhancing balance and preventing congestion.
However, due to the varying delays (propagation and switching) across
these paths, such a strategy may result in significant packet
disorder upon arrival at the destination, thereby degrading the
performance of both transport and application layers.
A viable solution is the resequencing of out-of-order packets at the
egress Top-of-Rack (ToR) switch, employing per-packet ECMP. This
assumes multipath transmission extends from ingress to egress ToRs,
with the reordering principle ensuring that the packet departure
sequence from the last ToR mirrors the arrival sequence at the first
ToR.
Achieving packet reordering at the egress ToR necessitates a clear
indication of packet arrival sequences at the ingress ToR. Current
protocols do not directly mark sequence numbers (SQNs) at the
Ethernet and IP layers.
* Presently, SQNs are encapsulated within transport layers (e.g.,
TCP, QUIC, RoCEv2) or application protocols. Relying on these
SQNs for packet reordering requires network devices to interpret a
vast array of transport/application layer information.
* SQNs at the transport/application layer are allocated per flow,
with each having distinct sequence number spaces and initial
values. These cannot directly represent the packet arrival
sequence at the initial ToR. Although assigning a specific
reordering queue to each flow at the egress ToR and reordering
based on upper-layer SQNs is conceivable, the associated hardware
resource demands are significant.
Wang, et al. Expires 2 September 2024 [Page 5]
Internet-Draft APDN March 2024
* Direct modification of upper-layer SQNs by network devices to
reflect ToR-ToR pairwise SQNs compromises end-to-end transmission
reliability.
Consequently, a mechanism to convey specific order information across
the multipath forwarding domain, from the initial to the final device
with reordering capabilities, is essential.
The Application-aware Networking (APN) framework is proposed to
transport critical ordering information. In this context, it records
the sequence number of packets as they arrive at the ingress ToR
(each ToR-ToR pair having a unique, incremental SQN), facilitating
packet reordering by the egress ToR based on this data.
Requirements:
* [REQ1-1] The APN framework SHOULD tag each packet with an SQN
alongside the APN ID to enable reordering. The ingress ToR SHOULD
assign and log an SQN for each packet based on its arrival
sequence, with SQN granularity adaptable to ToR-ToR, port-port, or
queue-queue levels.
* [REQ1-2] The APN-encapsulated SQN MUST remain unaltered within the
multipathing domain and may be removed at the egress device.
* [REQ1-3] The APN framework SHOULD convey necessary queue
information (i.e., the sorting queue ID) to support fine-grained
reordering. The queue ID SHOULD match the granularity of SQN
assignments. Additionally, the APN framework COULD transport path
details to expedite the differentiation between out-of-order
packets and packet loss.
2.2. Enhancing Distributed Machine Learning Training with In-Network
Computing
Distributed machine learning training frequently employs the
AllReduce communication mode[mpi-doc] for efficient cross-accelerator
data transfer. This method is pivotal in scenarios involving data
and model parallelism, where parallel execution across multiple
processors necessitates the exchange of intermediate results, such as
gradient data, as a core component of the communication process.
The Parameter Server (PS) architecture[atp], which centralizes
gradient data aggregation through a server from multiple clients and
redistributes the aggregated results, often faces incast congestion
challenges due to simultaneous large-volume data transmissions to the
server.
Wang, et al. Expires 2 September 2024 [Page 6]
Internet-Draft APDN March 2024
In-network computing (INC) introduces a paradigm shift by delegating
the server's processing tasks to network switches. Utilizing network
devices equipped with high-capacity switching and computational
abilities (for basic arithmetic operations) as surrogate parameter
servers for gradient aggregation enables the consolidation of
multiple data streams into a singular network stream. This approach
not only alleviates server-side incast congestion but also leverages
the superior speed of on-switch computing (e.g., ASICs) over
traditional server-based processing (e.g., CPUs), offering a boon to
distributed computing applications.
As outlined in [I-D.draft-lou-rtgwg-sinc], the realization of INC
requires network devices to comprehend the computing tasks dictated
by applications, including the accurate parsing of relevant data
units and the coordination of synchronization signals across diverse
data sources.
Present implementations like ATP[atp] and NetReduce[netreduce]
necessitate that switches interpret upper-layer protocols and
application-specific logic, which remains tailored to particular
applications due to the absence of standardized transport or
application protocols for INC. To accommodate a broad spectrum of
INC applications, network devices must exhibit versatility across
various protocol formats.
Moreover, while end users may encrypt payloads for security, they
might be inclined to expose certain non-sensitive data to benefit
from accelerated INC operations. However, the current protocol
landscape does not facilitate easy access to necessary INC data
without decrypting the entire payload, posing interoperability
challenges between applications and INC functionalities.
The Application-aware Networking (APN) framework emerges as a
solution, capable of conveying essential information for INC tasks
and their associated data segments, thereby enabling the offloading
of specific computational tasks to the network.
_Requirements_:
* [REQ2-1] The APN framework MUST include identifiers to
differentiate among INC tasks.
* [REQ2-2] The APN framework MUST accommodate the transport of
application data in varied formats and lengths, such as gradient
data for INC, along with the specified operations.
Wang, et al. Expires 2 September 2024 [Page 7]
Internet-Draft APDN March 2024
* [REQ2-3] To augment INC efficiency, the APN framework SHOULD
transmit additional application-aware information to support
computational processes without undermining end-to-end transport
reliability.
* [REQ2-4] The APN framework MUST have the capability to convey
comprehensive INC outcomes and document the computational status
within data packets.
2.3. Enhanced Congestion Control with Precise Feedback Mechanisms
Data center environments encompass various congestion scenarios,
notably:
* The prevalent use of multi-accelerator collaborative AI model
training, employing AllReduce and All2All communication patterns
(Section 2.2), often leads to server-side incast congestion as
multiple clients simultaneously transmit substantial volumes of
gradient data.
* Diverse load balancing methodologies across different flows can
induce overload conditions on specific links.
* The inherent randomness of service access within data centers
frequently triggers traffic bursts, extending queue lengths and
precipitating congestion.
To mitigate these challenges, the industry has developed an array of
congestion control algorithms tailored for data center networks.
ECN-based congestion control mechanisms, such as DCTCP[RFC8257] and
DCQCN[dcqcn], leverage ECN marks based on switch buffer occupancy
levels to signal congestion.
However, these approaches are constrained by the use of a singular
1-bit mark within packet headers to denote congestion, limiting the
scope of conveyed congestion details due to header space
restrictions. Alternative strategies, such as HPCC++
[I-D.draft-miao-ccwg-hpcc], adopt in-band telemetry to cumulatively
append congestion data at each hop, increasing packet length and
bandwidth consumption.
Wang, et al. Expires 2 September 2024 [Page 8]
Internet-Draft APDN March 2024
A compromise solution, AECN[I-D.draft-shi-ippm-advanced-ecn],
endeavors to encapsulate critical congestion indicators along the
path while minimizing data overhead through hop-by-hop aggregation,
including queue delay and congested hop counts. This model allows
end-hosts to specify the congestion metrics of interest, with network
devices incrementally compiling this data en route. APN frameworks
can facilitate this nuanced exchange, enabling tailored congestion
data accumulation.
_Requirements_:
* [REQ3-1] The APN framework MUST empower data senders to specify
the congestion metrics they wish to gather.
* [REQ3-2] The APN framework MUST enable network nodes to log and
update selected measurements accordingly. This may encompass
metrics such as port queue lengths, link monitoring rates, PFC
frame counts, probed RTTs, and variability, among others.
Additionally, the APN MAY tag each measurement with its collector,
assisting in the identification of potential congestion points.
3. Encapsulation
The encapsulation of application-aware information proposed by use
cases of APDN in the APN Header [I-D.draft-li-apn-header] will be
defined in the future version of the draft.
4. Security Considerations
TBD.
5. IANA Considerations
This document has no IANA actions.
6. References
6.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/rfc/rfc2119>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.
Wang, et al. Expires 2 September 2024 [Page 9]
Internet-Draft APDN March 2024
6.2. Informative References
[mpi-doc] "Message-Passing Interface Standard", August 2023,
<https://www.mpi-forum.org/docs/mpi-4.1>.
[dcqcn] "Congestion Control for Large-Scale RDMA Deployments",
n.d.,
<https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/
p523.pdf>.
[netreduce]
"NetReduce - RDMA-Compatible In-Network Reduction for
Distributed DNN Training Acceleration", n.d.,
<https://arxiv.org/abs/2009.09736>.
[atp] "ATP - In-network Aggregation for Multi-tenant Learning",
n.d.,
<https://www.usenix.org/conference/nsdi21/presentation/
lao>.
[I-D.li-apn-framework]
Li, Z., Peng, S., Voyer, D., Li, C., Liu, P., Cao, C., and
G. S. Mishra, "Application-aware Networking (APN)
Framework", Work in Progress, Internet-Draft, draft-li-
apn-framework-07, 3 April 2023,
<https://datatracker.ietf.org/doc/html/draft-li-apn-
framework-07>.
[I-D.li-rtgwg-apn-app-side-framework]
Li, Z. and S. Peng, "Extension of Application-aware
Networking (APN) Framework for Application Side", Work in
Progress, Internet-Draft, draft-li-rtgwg-apn-app-side-
framework-00, 22 October 2023,
<https://datatracker.ietf.org/doc/html/draft-li-rtgwg-apn-
app-side-framework-00>.
[I-D.draft-lou-rtgwg-sinc]
Lou, Z., Iannone, L., Li, Y., Zhangcuimin, and K. Yao,
"Signaling In-Network Computing operations (SINC)", Work
in Progress, Internet-Draft, draft-lou-rtgwg-sinc-01, 15
September 2023, <https://datatracker.ietf.org/doc/html/
draft-lou-rtgwg-sinc-01>.
[RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
October 2017, <https://www.rfc-editor.org/rfc/rfc8257>.
Wang, et al. Expires 2 September 2024 [Page 10]
Internet-Draft APDN March 2024
[I-D.draft-miao-ccwg-hpcc]
Miao, R., Anubolu, S., Pan, R., Lee, J., Gafni, B.,
Tantsura, J., Alemania, A., and Y. Shpigelman, "HPCC++:
Enhanced High Precision Congestion Control", Work in
Progress, Internet-Draft, draft-miao-ccwg-hpcc-02, 29
February 2024, <https://datatracker.ietf.org/doc/html/
draft-miao-ccwg-hpcc-02>.
[I-D.draft-shi-ippm-advanced-ecn]
Shi, H., Zhou, T., and Z. Li, "Advanced Explicit
Congestion Notification", Work in Progress, Internet-
Draft, draft-shi-ippm-advanced-ecn-00, 11 December 2023,
<https://datatracker.ietf.org/doc/html/draft-shi-ippm-
advanced-ecn-00>.
[I-D.draft-li-apn-header]
Li, Z., Peng, S., and S. Zhang, "Application-aware
Networking (APN) Header", Work in Progress, Internet-
Draft, draft-li-apn-header-04, 12 April 2023,
<https://datatracker.ietf.org/doc/html/draft-li-apn-
header-04>.
Acknowledgements
Contributors
Authors' Addresses
Haibo Wang
Huawei
Email: rainsword.wang@huawei.com
Kehan Yao
China Mobile
Email: yaokehan@chinamobile.com
Wei Pan
Huawei
Email: tarzan.pan@huawei.com
Hongyi Huang
Huawei
Email: hongyi.huang@huawei.com
Wang, et al. Expires 2 September 2024 [Page 11]