Internet DRAFT - draft-xu-rtgwg-topo-aware-collective-with-inc
draft-xu-rtgwg-topo-aware-collective-with-inc
Routing Area Working Group S. Xu
Internet-Draft K. Yao
Intended status: Informational China Mobile
Expires: 11 January 2024 10 July 2023
Topology-aware Collective Communication in In-Network Computing Enabled
Network: Problem Statement and Requirements
draft-xu-rtgwg-topo-aware-collective-with-inc-00
Abstract
In this document, the mapping mechanism between the logical and
physical topology of collective communication is analysed in In-
Network Computing(INC) enabled network, as well as the impact of
topology-aware collective communication algorithms on INC enabled
large-scale computing clusters. Requirements are also proposed to
design efficient mapping mechanism between logical and physical
topology and topology-aware collective communication algorithms.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 11 January 2024.
Copyright Notice
Copyright (c) 2023 IETF Trust and the persons identified as the
document authors. All rights reserved.
Xu & Yao Expires 11 January 2024 [Page 1]
Internet-Draft Routing Area Working Group July 2023
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Conventions Used in This Document . . . . . . . . . . . . . . 3
2.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3
2.2. Requirements Language . . . . . . . . . . . . . . . . . . 3
3. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 3
4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 4
5. Security Considerations . . . . . . . . . . . . . . . . . . . 5
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 5
7. Informative References . . . . . . . . . . . . . . . . . . . 5
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 5
1. Introduction
Large scale supercomputing systems have witnessed significant growth
in the recent history. At the heart of these systems are compute
nodes based on modern multi-core architectures and high speed
networks. These systems offer vast amounts of computing power and
resources to application developers and are allowing scientific
applications to scale out to tens of thousands of processes.
These processes rely on Message Passing Interface (MPI) for
information exchange and complete parallel computing. The hardware
network in reality is a physical network, while the communication
between processes that are independent of hardware devices is
abstracted as a logical network. An important aspect of
communication in parallel computing is the rational mapping between
logical network and physical network. When INC is introduced, the
network hardware can also join the process of collective
communication., which in turn will impact the overall communication
model. Therefore, In INC enabled large-scale clusters, the mapping
rules need to be adjusted accordingly.
In large scale clusters, the network contention can significantly
impact the performance of applications when the processor allocation
is scattered across different racks in the cluster. It is critical
to discover the topology of such clusters and design collective
message exchange algorithms that are aware of the topology in order
Xu & Yao Expires 11 January 2024 [Page 2]
Internet-Draft Routing Area Working Group July 2023
to improve the overall performance of real-world applications. After
introducing INC, the topology discovery algorithm should not be
limited to factors such as network structure and bandwidth, but also
consider factors such as INC capacities and computational load.
2. Conventions Used in This Document
2.1. Terminology
INC In-Network Computing
MPI Message Passing Interface
2.2. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP
14[RFC2119][RFC8174] when, and only when, they appear in all
capitals, as shown here.
3. Problem Statement
In traditional mode, computing tasks are completed by computers and
servers in the cluster, and after enabling INC, some of the computing
tasks are transferred to network devices. As a result, for the same
MPI primitive, compared to traditional mode, after enabling INC, the
communication subjects in the logical topology can not only be mapped
to computers, but also to network devices. At the same time, the
implementation of certain MPI primitives based on INC may result in
topological difference compared to traditional patterns. The current
topology mapping mechanism does not consider the content above.
How to use topology-aware algorithms to improve MPI primitive
communication performance and reduce communication costs in large-
scale clusters has been a hot research direction. [TopoIB] presents
efficient topology-aware algorithms for two collective communication
primitives and proposed a communication model to analyze the
communication overhead of large-scale cluster communication. In
[Themis], a new scheduling mechanism and topology-aware algorithm are
proposed from the perspective of improving network bandwidth
utilization, and it was verified that the network bandwidth
utilization rate of a single AllReduce operation can be increased by
1.72 times. But when INC is enabled, these topology detection
algorithms will not only be limited to network characteristics such
as bandwidth and communication overhead, but should simultaneously
consider the computing and processing capabilies of network devices
themselves.
Xu & Yao Expires 11 January 2024 [Page 3]
Internet-Draft Routing Area Working Group July 2023
Hence, several problems are raised:
* How to properly map the communication logical topology subjects to
the INC enabled physical network subjects?
* How will enabling INC change the logical network topologies of MPI
primitives and what challenges will it bring?
* How do we efficiently discover the topology of an INC enabled large
scale cluster?
* What are the challenges involved in designing efficient collective
algorithms that are aware of the INC enabled network topology?
4. Requirements
The topology mapping algorithm between logical and physical networks
in large-scale clusters enabled by INC, as well as the topology-aware
collective communication algorithms used to enhance cluster
communication, need to meet the following requirements:
* INC enabled communication entities in large-scale clusters MUST not
only support mapping to computing nodes in physical network, but also
supporting mapping to network devices in physical network.
* After introducing INC, logical communication may change. MPI
primitives, for example, AllReduce, may correspond to one or more
logical topologies that support INC. However, from the aspect of
computation results, the implementation of logical topology that
supports INC MUST be equivalent to traditional methods.
* Topology detection algorithms in large-scale clusters that enable
INC not only need to consider network factors such as communication
overhead and path bandwidth, but also consider the INC capability and
computational load of network devices, such as SINC
[I-D.lou-rtgwg-sinc].
* The topology-aware collective communication algorithm SHOULD
consider the network path load as well as the impact of background
traffic on cluster communication performance in INC enabled large-
scale clusters.
* A reasonable evaluation model for INC enabled large-scale cluster
is REQUIRED, taking into account the factors such as connectivity
status and computing capabilities in network devices.
Xu & Yao Expires 11 January 2024 [Page 4]
Internet-Draft Routing Area Working Group July 2023
* The topology mapping algorithm and topology detection algorithm
SHOULD support the fallback mechanism, which can remap the logical
network to the traditional mode and achieve path detection after an
INC failure.
5. Security Considerations
TBD.
6. IANA Considerations
TBD.
7. Informative References
[I-D.lou-rtgwg-sinc]
Lou, Z., Iannone, L., Li, Y., Zhangcuimin, and K. Yao,
"Signaling In-Network Computing operations (SINC)", Work
in Progress, Internet-Draft, draft-lou-rtgwg-sinc-00, 7
June 2023, <https://datatracker.ietf.org/doc/html/draft-
lou-rtgwg-sinc-00>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/info/rfc8174>.
[Themis] Rashidi S, Won W, Srinivasan S, et al., "Themis: A Network
Bandwidth-Aware Collective Scheduling Policy for
Distributed Training of DL Models", May 2021,
<https://doi.org/10.48550/arXiv.2110.04478>.
[TopoIB] Kandalla K C, Subramoni H, Vishnu A, et al., "Designing
topology-aware collective communication algorithms for
large scale InfiniBand clusters: Case studies with Scatter
and Gather", December 2010,
<https://doi.org/10.1109/IPDPSW.2010.5470853>.
Authors' Addresses
Xu & Yao Expires 11 January 2024 [Page 5]
Internet-Draft Routing Area Working Group July 2023
Shiping Xu
China Mobile
Beijing
100053
China
Email: xushiping@chinamobile.com
Kehan Yao
China Mobile
Beijing
100053
China
Email: yaokehan@chinamobile.com
Xu & Yao Expires 11 January 2024 [Page 6]