Internet DRAFT - draft-lin-idr-bgp-nof-nlri
draft-lin-idr-bgp-nof-nlri
Network Working Group C. Lin
Internet Draft M. Chen
Intended status: Standards Track H. Li
Expires: May 29, 2024 New H3C Technologies
R. Wang
F. Qin
Q. Zhang
China Mobile
December 1, 2023
Distribution of Device Discovery Information in NVMe Over RoCEv2
Storage Network Using BGP
draft-lin-idr-bgp-nof-nlri-04
Abstract
This document proposes a method of distributing device discovery
information in NVMe over RoCEv2 storage network using the BGP
routing protocol. A new BGP Network Layer Reachability Information
(NLRI) encoding format, named NoF NLRI, is defined.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 29, 2024.
Copyright Notice
Copyright (c) 2023 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
Lin, et al. Expire May 29, 2024 [Page 1]
Internet-Draft BGP NoF NLRI December 2023
carefully, as they describe your rights and restrictions with
respect to this document. Code Components extracted from this
document must include Simplified BSD License text as described in
Section 4.e of the Trust Legal Provisions and are provided without
warranty as described in the Simplified BSD License.
Table of Contents
1. Introduction...................................................2
1.1. Requirements Language.....................................3
2. Distribution of Device Discovery Information Using BGP.........3
3. BGP Extentions.................................................6
3.1. TLV Format................................................6
3.2. NoF NLRI..................................................7
3.3. Device Discovery NLRI.....................................8
3.3.1. IPv4 Address TLV.....................................9
3.3.2. IPv6 Address TLV.....................................9
3.3.3. Role Type TLV.......................................10
3.3.4. Service Protocol TLV................................10
3.3.5. Device Status TLV...................................11
3.3.6. Status Changing Reason TLV..........................12
3.3.7. More Device Info TLVs...............................13
3.4. Device Zone NLRI.........................................13
3.5. Operations...............................................14
4. Security Considerations.......................................14
5. IANA Considerations...........................................14
6. References....................................................14
6.1. Normative References.....................................14
6.2. Informative References...................................15
Authors' Addresses...............................................16
1. Introduction
As data center networks keep growing, the performance of
communication methods needs to accelerate. At present, NVMe over
RoCEv2 is becoming a popular solution of storage network based on
Ethernet. In such network, a host accesses to an NVMe storage
subsystem via Ethernet Fabric with RoCEv2 protocol.
In the traditional way, the discovery of hosts and storage
subsystems is achieved by manual configurations. However, the manual
way is difficult for management and maintenance. In addition, the
reaction speed is slow when a device goes online or offline, making
it hard to realize hot-plug and failover. To solve these problems,
automatic discovery method should be deployed.
Lin, et al. Expires May 29, 2024 [Page 2]
Internet-Draft BGP NoF NLRI December 2023
When a host or storage subsystem is directly connected to a switch,
the device reports its information to the switch using device
discovery protocol like LLDP. Then, the device discovery information
is distributed to others switches in the fabric. Finally, other
devices get the information from the switches which they directly
connect with.
This document proposes a new method of distributing device discovery
information among switches in NVMe over RoCEv2 storage network using
the BGP routing protocol.
1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
2. Distribution of Device Discovery Information Using BGP
In hierarchical topology, a host or storage subsystem is usually
connected to a switch at access layer. In Clos topology, a host or
storage subsystem is usually connected to a "Leaf" switch. To keep
terminology uniform, in this document the switches which the hosts
and storage subsystems directed connect with will be referred to as
the access switches.
[ODCC-2020-05016] describes a solution for device discovery in NVMe
over RoCEv2 storage network.
+--------+ +--------+ +--------+ +--------+
| Host |--------| Switch |--------| Switch |--------|Storage |
+--------+ +--------+ +--------+ +--------+
|---------------->| |<----------------|
| LLDP Msg | | LLDP Msg |
| |<---------------->| |
| | Info Sync | |
|<----------------| |---------------->|
| Notification Msg| | Notification Msg|
When any host or storage subsystem is connected with an access
switch, it periodically sends LLDP messages to the access switch,
including its own information and subscription of interested
devices. The access switch receives the LLDP messages and maintains
the states of directly connected devices. If the state of any device
changes, such as going online or offline, the access switch will
notify the other directly connected devices which have subscribed.
Lin, et al. Expires May 29, 2024 [Page 3]
Internet-Draft BGP NoF NLRI December 2023
However, the devices on the other access switches may also be
interested with the device discovery information, especially in a
large-scale storage network. For example, when a storage subsystem
is newly connecting to an access switch, a host located in another
access switch needs to know that it gets online. Then the host will
establish connection with the storage subsystem, and transmit data
through NVMe over RoCEv2. If that storage subsystem failed, that
host, which has NVMe connection with the failed storage subsystem,
needs to be notified as soon as possible. Then the host will quickly
disconnect from the storage subsystem and switch over to the
redundant service. Therefore, the access switches are required to
distribute device discovery information among them. So that the host
can get the required information from the directly connected switch.
[ODCC-2020-05016] specifies the definitions of LLDP messages and
notification messages between access switches and hosts or storage
subsystems, but leaves the information synchronization method
undefined.
In this document the distribution of device discovery information
among access switches is achieved by using BGP. All the access
switches are BGP speakers, and the device discovery information is
exchanged as BGP routes among them.
In order to reduce the number of BGP connections, the application of
BGP Route Reflectors [RFC4456] is recommended. Figure 1 shows an
example of BGP connections with route reflectors. SW 1 and SW 2
serve as reflectors, and SW 3, SW 4, SW 5 and SW 6 are their
clients. When a client sends a BGP route, which contains device
discovery information, to a reflector, the reflector will reflect
the route to the other clients. Therefore, all the access switches
work as clients, and each of them only needs to establish BGP
connections to the reflectors, rather than establishing BGP
connections between each other. In this example, there are two
reflectors, SW 1 and SW 2, which run as a hot standby for each
other. It is also fine to deploy only one reflector in the network.
However, to improve availability, deploying multiple reflectors are
recommended.
Lin, et al. Expires May 29, 2024 [Page 4]
Internet-Draft BGP NoF NLRI December 2023
+---------+ +---------+
| SW 1 | | SW 2 | BGP Reflector
+---------+ +---------+
+-----+ | | | | | | |
| +---|-|-|------------------+ | | |
| | | | | +---------------+ | |
| | | | | | | +-----+
| | | | | | +----+ |
| | | | +----|------------|--------+ |
| | | +------|--------+ | | |
| | +----+ | | | | |
| | | | | | | |
+-------+ +-------+ +-------+ +-------+
| SW 3 | | SW 4 | | SW 5 | | SW 6 | BGP Client
+-------+ +-------+ +-------+ +-------+
* * * * * * * *
* * * * * * * *
H3 SS3 H4 SS4 H5 SS5 H6 SS6
SW: Switch
H: Host
SS: Storage Subsystem
--: BGP Connection
**: Access Link
Figure 1 BGP Connections with Route Reflectors
In Figure 1, the reflector switches are not directly connected with
hosts or storage subsystems, and they are not access switches.
Figure 2 shows another example, in which case two of the access
switches serve as BGP route reflectors. The main difference with
Figure 1 is that the reflectors, SW 1 and SW 2, also need to
establish BGP connections between each other. If any device directly
connected with the reflector goes online or offline, the reflector
not only sends the device discovery information to its clients, but
also sends information to the other reflectors.
Lin, et al. Expires May 29, 2024 [Page 5]
Internet-Draft BGP NoF NLRI December 2023
H1 SS1 H2 SS2
| | | |
| | | |
+---------+ +---------+
| SW 1 |--------------| SW 2 | BGP Reflector
+---------+ +---------+
+-----+ | | | | | | |
| +---|-|-|------------------+ | | |
| | | | | +---------------+ | |
| | | | | | | +-----+
| | | | | | +----+ |
| | | | +----|------------|--------+ |
| | | +------|--------+ | | |
| | +----+ | | | | |
| | | | | | | |
+-------+ +-------+ +-------+ +-------+
| SW 3 | | SW 4 | | SW 5 | | SW 6 | BGP Client
+-------+ +-------+ +-------+ +-------+
* * * * * * * *
* * * * * * * *
H3 SS3 H4 SS4 H5 SS5 H6 SS6
SW: Switch
H: Host
SS: Storage Subsystem
--: BGP Connection
**: Access Link
Figure 2 Access Switches Serve as Reflectors
This document mainly focus on the distribution method of device
discovery information among access switches. The interaction between
access switch and host, or the interaction between access switch and
storage subsystem, is beyond the scope of this document.
3. BGP Extentions
This document describes a mechanism by which device discovery
information can be distributed using the BGP routing protocol
[RFC4271]. This is achieved using a new BGP Network Layer
Reachability Information (NLRI) encoding format, named NoF NLRI.
3.1. TLV Format
Information in the NoF NLRI is encoded in Type/Length/Value
triplets. The TLV format is shown in Figure 3.
Lin, et al. Expires May 29, 2024 [Page 6]
Internet-Draft BGP NoF NLRI December 2023
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
// Value (variable) //
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 3: TLV Format
The Length field defines the length of the value portion in octets
(thus, a TLV with no value portion would have a length of zero). The
TLV is not padded to 4-octet alignment. Unrecognized types MUST be
preserved and propagated.
3.2. NoF NLRI
New AFI and SAFI are defined for the NoF NLRI: the NoF AFI/SAFI
(values to be assigned by the IANA).
In order for two BGP speakers to exchange NoF NLRI, they MUST use
BGP Capabilities Advertisement to ensure that they are both capable
of properly processing such NLRI. This is done as specified in
[RFC4760].
The format of the NoF NLRI is shown in the following figure.
+------------------+
| Type | 2 octets
+------------------+
| Length | 2 octets
+------------------+
| NoF NLRI | variable
+------------------+
where:
o Type: the type of NoF NLRI.
o Length: the length of the rest of the NLRI in octets, not
including the Type field or itself.
o NoF NLRI: carrying the device discovery information in NVMe over
Fabric networks.
BGP NoF NLRI for both IPv4 and IPv6 networks can be carried over
either an IPv4 BGP session or an IPv6 BGP session. If an IPv4 BGP
session is used, then the next hop in the MP_REACH_NLRI SHOULD be an
Lin, et al. Expires May 29, 2024 [Page 7]
Internet-Draft BGP NoF NLRI December 2023
IPv4 address. Similarly, if an IPv6 BGP session is used, then the
next hop in the MP_REACH_NLRI SHOULD be an IPv6 address. Usually,
the next hop will be set to the local endpoint address of the BGP
session. The next-hop address MUST be encoded as described in
[RFC4760].
The Device Discovery NLRI and Device Zone NLRI are currently defined
in this document. More types of NLRI will be included in the future
version.
+------+---------------------------+
| Type | NoF NLRI Type |
+------+---------------------------+
| 1 | Device Discovery NLRI |
| 2 | Device Zone NLRI |
+------+---------------------------+
3.3. Device Discovery NLRI
The Device Discovery NLRI is used to carry the discovery information
of directly connected devices. The format of the Device Discovery
NLRI is shown in the following figure.
+------------------+
| Router ID | 4 octets
+------------------+
| Mac Address | 6 octets
+------------------+
| Port Name Length| 2 octets
+------------------+
| Port Name | variable
+------------------+
| Device Info | variable
+------------------+
where:
o Router ID: the Router ID of the access switch which originates
this NLRI, usually the same as the BGP Identifier.
o Mac Address: the Mac Address of a connected device.
o Port Name Length: the length of the following Port Name field in
octets.
o Port Name: the name of the connecting port, to distinguishing
different ports which share the same Mac Address.
Lin, et al. Expires May 29, 2024 [Page 8]
Internet-Draft BGP NoF NLRI December 2023
o Device Info: the specific information of the connected device and
its connecting port, which are identified by the above Mac
Address and Port Name fields.
The Device Discovery NLRI carries the information of a device which
is identified by the Router ID of the access switch and the Mac
Address and Port Name of the connected port.
For the purpose of BGP route key processing, only the Router ID, Mac
Address, Port Name Length, and Port Name fields are considered to be
part of the prefix in the NLRI.
The Device Info field may contain the following TLVs.
3.3.1. IPv4 Address TLV
The format of the IPv4 Address TLV is shown in the following figure.
+------------------+
| Type | 2 octets
+------------------+
| Length | 2 octets
+------------------+
| IPv4 Address | 4 octets
+------------------+
where:
o Type: 1.
o Length: 4.
o IPv4 Address: the IPv4 Address of the connecting port.
3.3.2. IPv6 Address TLV
The format of the IPv6 Address TLV is shown in the following figure.
+------------------+
| Type | 2 octets
+------------------+
| Length | 2 octets
+------------------+
| IPv6 Address | 16 octets
+------------------+
where:
Lin, et al. Expires May 29, 2024 [Page 9]
Internet-Draft BGP NoF NLRI December 2023
o Type: 2.
o Length: 16.
o IPv6 Address: the IPv6 Address of the connecting port.
3.3.3. Role Type TLV
The format of the Role Type TLV is shown in the following figure.
+------------------+
| Type | 2 octets
+------------------+
| Length | 2 octets
+------------------+
| Role Type | 1 octets
+------------------+
where:
o Type: 3.
o Length: 1.
o Role Type: the role of the device. The following values are
defined.
* 1: storage subsystem.
* 2: host.
* 3: the device can serve as both a host and a storage
subsystem.
3.3.4. Service Protocol TLV
The format of the Service Protocol TLV is shown in the following
figure.
Lin, et al. Expires May 29, 2024 [Page 10]
Internet-Draft BGP NoF NLRI December 2023
+-----------------------------+
| Type | 2 octets
+-----------------------------+
| Length | 2 octets
+-----------------------------+
| Protocol Type | 1 octets
+-----------------------------+
| Protocol Version | 2 octets
+-----------------------------+
| Protocol Port | 2 octets
+-----------------------------+
| Protocol Identifier Length | 1 octets
+-----------------------------+
| Protocol Identifier | variable octets
+-----------------------------+
where:
o Type: 4.
o Length: the length of the rest of the TLV in octets.
o Protocol Type: the type of the service protocol. The following
values are defined.
* 0: NVMe over RoCEv2.
o Protocol Version: the version of the service protocol.
o Protocol Port: the port number used by the service protocol. The
value 0 indicates the default or well-known port number.
o Protocol Identifier Length: the length of the following Protocol
Identifier field in octets.
o Protocol Identifier: the device identifier used by the service
protocol.
3.3.5. Device Status TLV
The format of the Device Status TLV is shown in the following
figure.
Lin, et al. Expires May 29, 2024 [Page 11]
Internet-Draft BGP NoF NLRI December 2023
+------------------------+
| Type | 2 octets
+------------------------+
| Length | 2 octets
+------------------------+
| Device Status | 4 octets
+------------------------+
where:
o Type: 5.
o Length: 4.
o Device Status: the current status of the device. The following
values are defined.
* 0: offline.
* 1: online.
3.3.6. Status Changing Reason TLV
The format of the Device Status TLV is shown in the following
figure.
+-------------------------+
| Type | 2 octets
+-------------------------+
| Length | 2 octets
+-------------------------+
| Status Changing Reason | 4 octets
+-------------------------+
where:
o Type: 6.
o Length: 4.
o Status Changing Reason: the reason of the device status changing.
The following values are defined.
* 0: normal.
* 1: link failure.
* 2: PFC storm.
Lin, et al. Expires May 29, 2024 [Page 12]
Internet-Draft BGP NoF NLRI December 2023
* 3: access network failure.
* 4: zone change.
* 5: configuration change.
* 6: lldp age-out.
3.3.7. More Device Info TLVs
More Device Info TLVs will be included in the future version of this
document.
3.4. Device Zone NLRI
In storage networks, hosts and storage subsystems are generally
divided into several zones. Only the devices in the same zone are
allowed to discover and communicate with each other.
The Device Zone NLRI is used to distribute the zone configuration of
a device. The format of the Device Zone NLRI is shown in the
following figure.
+------------------+
| Router ID | 4 octets
+------------------+
| IP Type | 1 octets
+------------------+
| IP Address | 4 or 16 octets
+------------------+
| Zone Name Length| 2 octets
+------------------+
| Zone Name | variable
+------------------+
where:
o Router ID: the Router ID of the access switch which originates
this NLRI, usually the same as the BGP Identifier.
o IP Type: indicating the type of IP Address. The following values
are defined.
* 0: IPv4.
* 1: IPv6.
o IP Address: the IPv4 or IPv6 Address of a connected device.
Lin, et al. Expires May 29, 2024 [Page 13]
Internet-Draft BGP NoF NLRI December 2023
o Zone Name Length: the length of the following Zone Name field in
octets.
o Zone Name: the name of the zone which the connected device
belongs to.
3.5. Operations
The source of the NoF NLRI can be a dedicated module which receive
LLDP messages and maintain the states of directly connected devices.
For the originator of an NoF NLRI route, BGP receives information
from relevant module, encapsulates the information into an NoF NLRI
route, and sends the route to other peers. For the receiver of an
NoF NLRI route, BGP extracts the NoF NLRI from the route and sends
the information to relevant module.
The NoF NLRI field may be treated as an opaque hexadecimal string,
depending on the implementation.
4. Security Considerations
TBD
5. IANA Considerations
TBD
6. References
6.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, DOI
10.17487/RFC2119, March 1997, <https://www.rfc-
editor.org/info/rfc2119>.
[RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
Border Gateway Protocol 4 (BGP-4)", RFC 4271, DOI
10.17487/RFC4271, January 2006, <https://www.rfc-
editor.org/info/rfc4271>.
[RFC4760] Bates, T., Chandra, R., Katz, D., and Y. Rekhter,
"Multiprotocol Extensions for BGP-4", RFC 4760, DOI
10.17487/RFC4760, January 2007, <https://www.rfc-
editor.org/info/rfc4760>.
Lin, et al. Expires May 29, 2024 [Page 14]
Internet-Draft BGP NoF NLRI December 2023
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/info/rfc8174>.
6.2. Informative References
[RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route
Reflection: An Alternative to Full Mesh Internal BGP
(IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006,
<https://www.rfc-editor.org/info/rfc4456>.
[ODCC-2020-05016] Open Data Center Committe, "NVMe over RoCEv2
Network Control Optimization Technical Requirements and
Test Specifications", 2020.
Lin, et al. Expires May 29, 2024 [Page 15]
Internet-Draft BGP NoF NLRI December 2023
Authors' Addresses
Changwang Lin
New H3C Technologies
Email: linchangwang.04414@h3c.com
Mengxiao Chen
New H3C Technologies
Email: chen.mengxiao@h3c.com
Hao Li
New H3C Technologies
Email: lihao@h3c.com
Ruixue Wang
China Mobile
Email: wangruixue@chinamobile.com
Fengwei Qin
China Mobile
Email: qinfengwei@chinamobile.com
Qi Zhang
China Mobile
Email: zhangqiyjy@chinamobile.com
Lin, et al. Expires May 29, 2024 [Page 16]