Internet DRAFT - draft-guo-ffd-requirement
draft-guo-ffd-requirement
Network Working Group L. Guo
Internet-Draft CAICT
Intended status: Informational Y. Feng
Expires: 2 September 2024 China Mobile
J. Zhao
China Telecom
F. Qin
China Mobile
L. Zhao
H. Wang
Huawei
W. Quan
Beijing Jiaotong University
H. Huang
Huawei
1 March 2024
Requirement of Fast Fault Detection for IP-based Network
draft-guo-ffd-requirement-02
Abstract
The IP-based distributed system and software application layer often
use heartbeat to maintain the network topology status. However, the
heartbeat setting is long, which prolongs the system fault detection
time. This document describes the requirements for a fast fault
detection solution of IP-based network.
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Guo, et al. Expires 2 September 2024 [Page 1]
Internet-Draft Requirement of FFD for IP-based Network March 2024
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 2 September 2024.
Copyright Notice
Copyright (c) 2024 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3
3. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1. IP-based NVMe . . . . . . . . . . . . . . . . . . . . . . 3
3.2. Distributed Storage . . . . . . . . . . . . . . . . . . . 7
3.3. Cluster Computing . . . . . . . . . . . . . . . . . . . . 8
4. Requirement . . . . . . . . . . . . . . . . . . . . . . . . . 9
5. Security Considerations . . . . . . . . . . . . . . . . . . . 9
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9
7. References . . . . . . . . . . . . . . . . . . . . . . . . . 9
7.1. Normative References . . . . . . . . . . . . . . . . . . 10
7.2. Informative References . . . . . . . . . . . . . . . . . 10
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10
1. Introduction
In the face of ever-expanding data, the powerful single-server system
cannot meet the requirements of data analysis and storage. At the
same time, with the increase of Ethernet network bandwidth and scale,
the distributed system that communicates through the network emerges
and develops rapidly. Heartbeat is a common network topology
maintenance technology used in distributed systems and software
application layers. However, if the heartbeat is set too short, the
current network congestion may lead to misjudgment. If the value of
this parameter is too long, the judgment is slow. Generally, you
Guo, et al. Expires 2 September 2024 [Page 2]
Internet-Draft Requirement of FFD for IP-based Network March 2024
need to balance and set the parameters based on various conditions.
IP-based NVMe, distributed storage and Cluster Computing are used for
core application scenarios. The requirements for performance and
impact of faults on services are increasing. This document describes
application scenarios and capability requirements for fast fault
detection in scenarios such as IP-based NVMe, artificial
intelligence, and distributed storage.
2. Terminology
AI:Artificial intelligence
FC: Fiber Channel
HPC: High-performance computing
NVMe: Non-Volatile Memory Express
IP-based NVMe: using RDMA or TCP to transport NVMe through Ethernet
NoF: NVMe of Fabrics
3. Use Cases
3.1. IP-based NVMe
For a long time, the key storage applications and high performance
requirements are mainly based on FC networks. With the increase of
transmission rates, the medium has evolved from HDDs to solid-state
storage and the protocol has evolved from SATA to NVMe. The
emergence of new NVMe technologies brings new opportunities. With
the development of the NVMe protocol, the application scenario of the
NVMe protocol is extended from PCIe to other fabrics, solving the
problem of NVMe extension and transmission distance. The block
storage protocol uses NoF to replace SCSI, reducing the number of
protocol interactions from application hosts to storage systems. The
end-to-end NVMe protocol greatly improves performance.
Fabrics of NoF include Ethernet, Fibre Channel and InfiniBand.
Comparing FC-NVMe to Ethernet- or InfiniBand-based Network
alternatives generally takes into consideration the advantages and
disadvantages of the networking technologies. Fibre Channel fabrics
are noted for their lossless data transmission, predictable and
consistent performance, and reliability. Large enterprises tend to
favor FC storage for mission-critical workloads. But Fibre Channel
requires special equipment and storage networking expertise to
operate and can be more costly than IP-based alternatives. Like FC,
Guo, et al. Expires 2 September 2024 [Page 3]
Internet-Draft Requirement of FFD for IP-based Network March 2024
InfiniBand is a lossless network requiring special hardware. IP-
based NVMe storage products tend to be more plentiful than FC-NVMe-
based options. Most storage startups focus on IP-based NVMe. But
unlink FC, The Ethernet switch does not notify the change of device
status. When the device is faulty, relying on the NVMe link
heartbeat message mechanism, the host takes tens of seconds to
complete service failover.
+--------------------------------------+
| NVMe Host Software |
+--------------------------------------+
+--------------------------------------+
| Host Side Transport Abstraction |
+--------------------------------------+
/\ /\ /\ /\ /\
/ \ / \ / \ / \ / \
FC IB RoCE iWARP TCP
\ / \ / \ / \ / \ /
\/ \/ \/ \/ \/
+--------------------------------------+
|Controller Side Transport Abstraction |
+--------------------------------------+
+--------------------------------------+
| NVMe SubSystem |
+--------------------------------------+
Figure 1: NVMe SubSystem
This section describes the application scenarios and capability
requirements of the IP-based NVMe storage that implements fast fault
detection similar to FC.
The NVMe over RDMA or IP-based network in storage includes three
types of roles: an initiator (referred to as a host), a switch, and a
target (referred to as a storage device). Initiators and targets are
also referred to as endpoint devices.
Guo, et al. Expires 2 September 2024 [Page 4]
Internet-Draft Requirement of FFD for IP-based Network March 2024
+--+ +--+ +--+ +--+
Host |H1| |H2| |H3| |H4|
(Initiator) +/-+ +-,+ +.-+ +/-+
| | '. ,-`| |
| | `', | |
| | ,-` '. | |
+-\--+ +--`-+ +`'--+ +-\--+
| SW | | SW | | SW | | SW |
+--,-+ +---,, +,.--+ +-.--+
`. `'.,` .`
`. _,-'` ``'., .`
IP +--'`+ +`-`-+
Network | SW | | SW |
+--,,+ +,.,-+
.` `'., ,.-`` ',
.` _,-'` `.
+--`-+ +--'`+ `'---+ +-`'-+
| SW | | SW | | SW | | SW |
+-.,-+ +-..-+ +-.,-+ +-_.-+
| '. ,-` | | `., .' |
| `', | | '.` |
| ,-` '. | | ,-` `', |
Storage +-`+ `'\+ +-`+ +`'+
(Target) |S1| |S2| |S3| |S4|
+--+ +--+ +--+ +--+
Figure 2: NVMe over IP-based Network
Hosts and storage devices are connected to the network separately and
in order to achieve high reliability, each host and storage device
are connected to dual network planes simultaneously. The host can
read and write data services when an NVMe connection is established
between the host and the storage device.
When a storage device link is faulty during running, the host cannot
detect the fault status of the indirectly connected device at the
transport layer. Based on the IP-based NVMe protocol, the host uses
the NVMe heartbeat to detect the status of the storage device. The
heartbeat message interval is 5s. Therefore, it takes tens of
seconds to determine whether the storage device is faulty and perform
service switchover using the multipath software. Failure tolerance
time for core applications cannot be reached. In order to obtain the
best customer experience and business reliability requirement, we
need to enhance fault detection and failover for IP-based NVMe.
The storage system has an active-active solution. The proposal, the
second active path can be used to transfer faults to drive the
switchover of the source node, is going on in NVMe. However, this
can only solve the local link faults of the storage node, but cannot
Guo, et al. Expires 2 September 2024 [Page 5]
Internet-Draft Requirement of FFD for IP-based Network March 2024
solve the problem of unconverged network faults. In storage
application deployment scenarios, independent dual-plane networking
maybe used. In this deployment, a single-plane device may be faulty.
In this case, network convergence cannot be performed completely.
In this proposal, a fast fault detection solution with switch
participation is proposed. This scheme utilizes the ability of
switches to detect faults quickly at the physical layer and link
layer, and allows the switch to synchronize the detected fault
information in the IP network, and then notify the fault status to
the endpoint devices.
Fault detection procedure: The host can detect the fault status of
the storage device and quickly switch to the standby path.
1. If a storage fault occurs, the access switch detects the fault at
the storage network layer or link layer.
2. The switch synchronizes the status to other switches on the
network.
3. The switch notifies the storage fault information to the hosts.
4. Quickly disconnect the connection from the storage device and
trigger the multipathing software to switch services to the
redundant path. The fault should be detected within 1s.
+----+ +-------+ +-------+ +-------+
|Host| |Switch | |Switch | |Storage|
+----+ +-------+ +-------+ +-------+
| | |-+ |
| | |1| |
| | |-+ |
| |<----2------| |
| | | |
|<----3-------| | |
| | | |
|<----4-------|------------|-----------> |
| | | |
Figure 3: Switches interact with hosts and storage devices
Guo, et al. Expires 2 September 2024 [Page 6]
Internet-Draft Requirement of FFD for IP-based Network March 2024
3.2. Distributed Storage
Distributed storage cluster devices are interconnected through a
network (back-end IP network) to establish a cluster. When a link
fault on a node or node fault occurs in the storage cluster, other
nodes in the storage cluster cannot detect the fault status of the
indirectly connected devices through the transport layer. Based on
the IP protocol, management or master nodes in a storage cluster use
heartbeats to detect the status of storage nodes. It takes 10
seconds or more to determine whether a storage device is faulty and
switch services to another normal storage node. Services cannot be
accessed during the fault. To achieve the best customer experience
and service reliability, we need to enhance the fault detection and
failover of IP-based cluster nodes.
Storage +--+ +--+ +--+ +--+
cluster |S1| |S2| |S3| |S4|
+--+ +--+ +--+ +--+
| '. ,-` |
| .`',_ |
| _ ..--` `'--.._ |
+-\--+ +-\--+
| SW | | SW |
+--,-+_ _+-.--+
`. `'--..._ _ .. -- '`_.`
`. _,-'` -._ .`
BACK Storage +--'`+ +`-`-+
IP Network | SW | | SW |
+----+ +----+
Figure 4: Distributed storage
The fast fault detection solution in this proposal can be used in
this scenario. This solution takes advantage of the switch's ability
to quickly detect faults at the physical layer and link layer, and
allows the switch to synchronize fault information detected on the IP
network. Then, the system notifies the storage cluster management
node or the primary node of the fault status.
Fault detection procedure:
1. If a storage fault occurs, the access switch detects the fault at
the storage network layer or link layer.
2. The switch synchronizes the status to other switches on the
network.
Guo, et al. Expires 2 September 2024 [Page 7]
Internet-Draft Requirement of FFD for IP-based Network March 2024
3. The switch notifies the storage fault information to the storage
management or master node. The fault should be detected within
1s.
+------+ +-------+ +-------+ +-------+
|master| |Switch | |Switch | |Storage|
+------+ +-------+ +-------+ +-------+
| | |-+ |
| | |1| |
| | |-+ |
| |<----2------| |
| | | |
|<----3---------| | |
| | | |
Figure 5: Switches interact with controller
3.3. Cluster Computing
In cluster computing scenarios, for example, HPC cluster applications
and AI cluster applications, cluster node faults and failures may
occur on any node at any time. However, for a high-performance
computing task, once a fault occurs, the entire task needs to be re-
scheduled. However, It takes several minutes for the management node
to detect the node fault status. During this period, new jobs may be
scheduled to the faulty node, causing task execution failure.
The fast fault detection solution in this proposal can be used in
this scenario. The fault can be detected within seconds.
+-----------------+ +-------+ +-------+ +----------+
| Management/ | |Switch | |Switch | | Computer |
| Scheduling node | | | | | | node |
+-----------------+ +-------+ +-------+ +----------+
| | |-+ |
| | |1| |
| | |-+ |
| |<----2------| |
| | | |
|<----3--------------------| | |
| | | |
Figure 6: Switches interact with HPC cluster
Fault detection procedure is similar to that of distributed storage
like figure 6.
Guo, et al. Expires 2 September 2024 [Page 8]
Internet-Draft Requirement of FFD for IP-based Network March 2024
4. Requirement
In distributed Ethernet systems and cross-network connection
scenarios, the following requirements are raised to accelerate
failover:
1. A network device can detect link or network failure.
2. A network device can synchronize the failure to other network
devices.
3. A network device can notify local/remote failure information to
local access endpoints.
4. The network device sends notification to the endpoints when it
detects, or being notified of the detection of, any of the
endpoints' subscribing failure .
5. Security Considerations
The functions in this requirement are mainly used in limited
networks, and the use of the functions needs to be deployed by the
operator and control the scope of use.
This requirement involves network devices notifying messages to
endpoint devices, which requires the cooperation of endpoint devices.
In addition, in order to limit the range of notification messages, it
is recommended that network devices use L2 messages to implement the
notification function, so that the range of notification messages
generated is limited to the access range of access nodes, and the
flood of notification messages will not be caused. In addition,
according to the scope of this required function, the notification
message should only be generated by the access network devices, and
should not be forwarded by the network device, so the network device
also needs to control the receiving and publishing behavior of the
messages.
The synchronization message between network devices is based on the
session between devices, and the message encryption and
authentication can be performed for this session, which is already a
mature technology.
6. IANA Considerations
NA
7. References
Guo, et al. Expires 2 September 2024 [Page 9]
Internet-Draft Requirement of FFD for IP-based Network March 2024
7.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
7.2. Informative References
Authors' Addresses
Liang Guo
CAICT
No.52, Hua Yuan Bei Road, Haidian District,
Beijing
100191
China
Email: guoliang1@caict.ac.cn
Yi Feng
China Mobile
12 Chegongzhuang Street, Xicheng District
Beijingraf
China
Email: fengyiit@chinamobile.com
Jizhuang Zhao
China Telecom
South District of Future Science and Technology, Changping District
Beijing
China
Email: zhaojzh@chinatelecom.cn
Fengwei Qin
China Mobile
12 Chegongzhuang Street, Xicheng District
Beijing
China
Email: qinfengwei@chinamobile.com
Guo, et al. Expires 2 September 2024 [Page 10]
Internet-Draft Requirement of FFD for IP-based Network March 2024
Lily Zhao
Huawei
No. 3 Shangdi Information Road, Haidian District
Beijing
China
Email: Lily.zhao@huawei.com
Haibo Wang
Huawei
No. 156 Beiqing Road
Beijing
P.R. China
Email: rainsword.wang@huawei.com
Wei Quan
Beijing Jiaotong University
3 Shangyuan Cun, Haidian District
Beijing
P.R. China
Email: weiquan@bjtu.edu.cn
Hongyi Huang
Huawei
No. 156 Beiqing Road
Beijing
100095
P.R. China
Email: hongyi.huang@huawei.com
Guo, et al. Expires 2 September 2024 [Page 11]