Internet DRAFT - draft-faibish-nfsv4-scsi-nvme-layout
draft-faibish-nfsv4-scsi-nvme-layout
Network File System Version 4 S. Faibish
Internet-Draft January 1, 2020 D. Black
Intended status: Informational Dell EMC
Expires: January 6, 2021 C. Hellwig
July 6, 2020
Using the Parallel NFS (pNFS) SCSI/NVMe Layout
draft-faibish-nfsv4-scsi-nvme-layout-00
Abstract
This document explains how to use the Parallel Network File System
(pNFS) SCSI NVMe Layout Types with transports using the
NVMe over Fabrics protocols. This draft picks the previous SCSI
over NVMe draft of C. Hellwig and extended it to support all the
types of transport protocols supported by NVMe transport over fabrics
additional to the SCSI transport protocol introduced in pNFS SCSI
Layout. The proposed transport protocols include support for
several transports and fabrics and support the RDMA transports.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of Internet-Draft Shadow Directories can be accessed at
https://www.ietf.org/standards/ids/internet-draft-mirror-sites/.
This Internet-Draft will expire on January 6, 2021.
Copyright Notice
Copyright (c) 2020 IETF Trust and the persons identified as the
document authors. All rights reserved.
Faibish Expires January 6, 2021 [Page 1]
Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Conventions Used in This Document . . . . . . . . . . . . . 2
1.2. General Definitions . . . . . . . . . . . . . . . . . . . . 2
2. SCSI Layout mapping to NVMe . . . . . . . . . . . . . . . . . . 3
2.1. Volume Identification . . . . . . . . . . . . . . . . . . . 7
2.2. Client Fencing . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1. Reservation Key Generation . . . . . . . . . . . . . . 8
2.2.2. MDS Registration and Reservation . . . . . . . . . . . 8
2.2.3. Client Registration . . . . . . . . . . . . . . . . . . 8
2.2.4. Fencing Action . . . . . . . . . . . . . . . . . . . . 8
2.2.5. Client Recovery after a Fence Action . . . . . . . . . 9
2.3. Volatile write caches . . . . . . . . . . . . . . . . . . . 10
3. Security Considerations . . . . . . . . . . . . . . . . . . . . 10
4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 10
5. Normative References . . . . . . . . . . . . . . . . . . . . . 11
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 11
1. Introduction
The pNFS Small Computer System Interface (SCSI) layout [RFC8154] is
layout type that allows NFS clients to directly perform I/O to block
storage devices while bypassing the MDS. It is specified by using
concepts from the SCSI protocol family for the data path to the
storage devices. This documents explains how to access PCI Express,
RDMA or Fibre Channel devices using the NVM Express protocol [NVME]
using the SCSI layout. This document does not amend the pNFS SCSI
layout document in any way, instead of explains how to map the SCSI
constructs used in the pNFS SCSI layout document to NVMe concepts
using the NVMe SCSI translation reference.
1.1. Conventions Used in This Document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
Faibish Expires January 6, 2021 [Page 2]
Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020
1.2. General Definitions
The following definitions are provided for the purpose of providing
an appropriate context for the reader.
Client The "client" is the entity that accesses the NFS server's
resources. The client may be an application that contains the logic
to access the NFS server directly. The client may also be the
traditional operating system client that provides remote file
system services for a set of applications.
Server/Controller The "server" is the entity responsible for
coordinating client access to a set of file systems and is
identified by a server owner.
2. SCSI Layout mapping to NVMe
The SCSI layout definition [RFC8154] only references few SCSI
specific concepts directly.
NVM Express [NVME] Base Specification revision 1.4 and prior
revisions define a register level interface for host software to
communicate with a non-volatile memory subsystem over PCI Express
(NVMe over PCIe). This specification defines extensions to NVMe
that enable operation over other interconnects (NVMe over Fabrics).
The NVM Express Base Specification revision 1.4 is referred to as
the NVMe Base specification.
The goal for this draft is to enable an implementer who is
familiar with the pNFS SCSI layout (RFC 8154) and the NVMe
standards (both NVMe-oF 1.1 and NVMe 1.4) to implement the
pNFS SCSI layout over NVMe-oF. The mapping of extensions defined
in this document refers to a specific NVMe Transport defined in
an NVMe Transport binding specification. This document refers to
NVMe Transport binding specification for FC, RDMA and TCP [RFC7525].
The NVMe Transport binding specification for Fibre
Channel is defined in INCITS 540 Fibre Channel - Non-Volatile
Memory Express [FC-NVMe].
NVMe over Fabrics has the following differences from the NVMe Base
specification used with SCSI:
- There is a one-to-one mapping between I/O Submission Queues and
I/O Completion Queues. NVMe over Fabrics does not support multiple
I/O Submission Queues being mapped to a single I/O Completion
Queue;
Faibish Expires January 6, 2021 [Page 3]
Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020
- NVMe over Fabrics does not define an interrupt mechanism that
allows a controller to generate a host interrupt. It is the
responsibility of the host fabric interface (e.g., Host Bus
Adapter) to generate host interrupts;
- NVMe over Fabrics does not use the Create I/O Completion Queue,
Create I/O Submission Queue, Delete I/O Completion Queue, and
Delete I/O Submission Queue commands. NVMe over Fabrics does
not use the Admin Submission Queue Base Address (ASQ), Admin
Completion Queue Base Address (ACQ), and Admin Queue Attributes
(AQA) properties (i.e., registers in PCI Express). Queues are
created using the Connect command;
- NVMe over Fabrics uses the Disconnect command to delete an I/O
Submission Queue and corresponding I/O Completion Queue;
- Metadata, if supported, shall be transferred as a contiguous part
of the logical block. NVMe over Fabrics does not support
transferring metadata from a separate buffer;
- NVMe over Fabrics does not support PRPs but requires use of SGLs
for Admin, I/O, and Fabrics commands. This differs from NVMe over
PCIe where SGLs are not supported for Admin commands and are
optional for I/O commands;
- NVMe over Fabrics does not support Completion Queue flow control.
This requires that the host ensures there are available Completion
Queue slots before submitting new commands; and
- NVMe over Fabrics allows Submission Queue flow control to be
disabled if the host and controller agree to disable it. If
Submission Queue flow control is disabled, the host is required
to ensure that there are available Submission Queue slots before
submitting new commands.
NVMe over Fabrics requires the underlying NVMe Transport to provide
reliable NVMe command and data delivery. An NVMe Transport is an
abstract protocol layer independent of any physical interconnect
properties. An NVMe Transport may expose a memory model, a message
model, or a combination of the two. A memory model is one in which
commands, responses and data are transferred between fabric nodes
by performing explicit memory read and write operations while a
message model is one in which only messages containing command
capsules, response capsules, and data are sent between fabric nodes.
The only memory model NVMe Transport supported by NVMe [NVME] is
PCI Express, as defined in the NVMe Base specification. While
differences exist between NVMe over Fabrics and NVMe over PCIe
implementations, both implement the same architecture and command
sets. But NVMe SCSI Translation reference is only using the NVMe
over Fabrics not the memory model. NVMe over Fabrics utilizes the
protocol layering shown in Figure 1. The native fabric
communication services and the Fabric Protocol and Physical Fabric
layers in Figure 1 are outside the scope of this specification.
Faibish Expires January 6, 2021 [Page 4]
Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020
+-------------------+
| pNFS host SCSI |
| layout over NVMe |
+---------+---------+
|
v
+-------------------+
| NVMe over Fabrics |
+---------+---------+
|
v
+-------------------+
| Transport Binding |
+---------+---------+
|
v
+--------------------+
| NVMe Transport svc |
+---------+----------+
|
v
+-------------------+
| NVMe Transport |
+---------+---------+
|
v
+--------------------+
| Fabric Protocol |
+---------+----------+
|
v
+-------------------+
| Physical Fabric |
+---------+---------+
|
v
+------------------------+
| pNFS SCSI layout |
| server/NVMe controller |
+------------------------+
Figure 1: pNFS SCSI over NVMe over Fabrics Layering
An NVM subsystem port may support multiple NVMe Transports if more
than one NVMe Transport binding specifications exist for the
underlying fabric (e.g., an NVM subsystem port identified by a
Port ID may support both iWARP and RoCE). This draft is also
defining NVMe binding implementation that uses the Transport
type of RDMA Transport. The RDMA Transport is RDMA Provider
agnostic. The diagram in Figure 2 illustrates the layering of
the RDMA Transport and common RDMA providers (iWARP, InfiniBand,
and RoCE) within the host and NVM subsystem.
Faibish Expires January 6, 2021 [Page 5]
Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020
+-------------------+
| NVMe Host |
+---------+---------+
| RDMA Transport |
+--------+---+------------+--+---------+
| iWARP | Infiniband | RoCE |
+------------+-----++-----+------------+
|| RDMA Fabric
vv
+------------+------------+--+---------+
| iWARP | Infiniband | RoCE |
+---------+--+------------+---+--------+
| RDMA Transport |
+-------------------+
| NVM Subsystem |
+-------------------+
Figure 2: RDMA Transport Protocol Layers
NVMe over Fabrics allows multiple hosts to connect to different
controllers in the NVM subsystem through the same port. All other
aspects of NVMe over Fabrics multi-path I/O and namespace sharing
are equivalent to that defined in the NVMe Base specification.
An association is established between a host and a controller when
the host connects to a controller's Admin Queue using the Fabrics
Connect command. Within the Connect command, the host specifies
the Host NQN, NVM Subsystem NQN, Host Identifier, and may request a
specific Controller ID or may request a connection to any available
controller. The host is the pNFS client and the controller is the
NFSv4 server. The pNFS clients connect to the server using different
network protocols and different transports excluding PCIe direct
connection. While an association exists between a host and
a controller, only that host may establish connections with I/O
Queues of that controller.
NVMe over Fabrics supports both fabric secure channel and NVMe
in-band authentication. An NVM subsystem may require a host to
use fabric secure channel, NVMe in-band authentication, or both.
The Discovery Service indicates if fabric secure channel shall be
used for an NVM subsystem. The Connect response indicates if NVMe
in-band authentication shall be used with that controller. For
SCSI over NVMe over Fabrics only the in-band authentication model
will be used as the fabric secure channel is only supported with
PCIe transport memory model not supported by SCSI layout protocol.
The pNFS SCSI layout uses the Device Identification VPD page (page
code 0x83) from [SPC4] to identify the devices used by a layout.
There are several ways to build SCSI Device
Faibish Expires January 6, 2021 [Page 6]
Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020
2.1. Volume Identification
Identification descriptors from NVMe Identify data included in the
Controller Identify Attributes specific to NVMe over Fabrics
specified in the Identify Controller fields in Section 4.1 of
[NVMEoF]. This document uses a subset of this information to
identify LUs backing pNFS SCSI layouts.
To be used as storage devices for the pNFS SCSI layout, NVMe
devices MUST support the EUI-64 [RFC8154] value in the
Identify Namespace data, as the methods based on the Serial
Number for legacy devices might not be suitable for unique
addressing needs and thus MUST NOT be used. UUID identification
can be added by
using a large enough enum value to avoid conflict with whatever
T10 might do in a future version of the SCSI [SBC3] standard (the
underlying SCSI field in SPC is 4 bits, so an enum value of 32
in this draft MUST be used). For NVMe, these identifiers need to
be obtained via the Namespace Identification Descriptors in NVMe
1.4 (returned by the Identify command with the CNS field set to
03h).
2.2. Client Fencing
The SCSI layout uses Persistent Reservations to provide client
fencing. For this both the MDS and the Clients have to register a
key with the storage device, and the MDS has to create a reservation
on the storage device. The pNFS SCSI protocol implements fencing
using persistent reservations (PRs), similar to the fencing method
used by existing shared disk file systems. To allow fencing
individual systems, each system MUST use a unique persistent
reservation key. The following is a full mapping of the required
PR IN and PR OUT SCSI command to NVMe commands which MUST be used
when using NVMe devices as storage devices for the pNFS SCSI layout.
2.2.1. Reservation Key Generation
Prior to establishing a reservation on a namespace, a host shall
become a registrant of that namespace by registering a reservation
key. This reservation key may be used by the host as a means of
identifying the registrant (host), authenticating the registrant,
and preempting a failed or uncooperative registrant. This document
assigns the burden to generate unique keys to the MDS, which MUST
generate a key for itself before exporting a volume and a key for
each client that accesses SCSI layout volumes.
One important difference between SCSI Persistent Reservations
and NVMe Reservations is that NVMe reservation keys always apply
to all controllers used by a host (as indicated by the NVMe Host
Identifier)
Faibish Expires January 6, 2021 [Page 7]
Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020
This behavior is somewhat similar to setting the ALL_TG_PT bit when
registering a SCSI Reservation key, but actually guaranteed to
work reliably.
2.2.2. MDS Registration and Reservation
Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the
MDS needs to prepare the volume for fencing using NVMeReservations.
Registering
a reservation key with a namespace creates an association between
a host and a namespace. A host that is a registrant of a namespace
may use any controller with which that host is associated (i.e.,
that has the same Host Identifier, refer to [NVME]
section-5.21.1.26) to access that namespace as a registrant.
2.2.3. Client Registration
2.2.3.1 SCSI client
Before performing the first I/O to a device returned from a
GETDEVICEINFO operation, the client will register the
reservation key returned by the MDS with the storage device
by issuing a "PERSISTENT RESERVE OUT" command with a service
action of REGISTER with the "SERVICE ACTION RESERVATION KEY" set
to the reservation key.
2.2.3.2 NVMe Client
A client registers a reservation key by executing a Reservation
Register command (refer to [NVME] section 6.11) on the namespace
with the Reservation Register Action (RREGA) field cleared to
000b (i.e., Register Reservation Key) and supplying a reservation
key in the New Reservation Key (NRKEY) field. A client that is a
registrant of a namespace may register the same reservation key
value multiple times with the namespace on the same or different
controllers. There are no restrictions on the reservation key
value used by hosts with different Host Identifiers.
2.2.4. Fencing Action
2.2.4.1 SCSI client
In case of a non-responding client, the MDS fences the client by
issuing a "PERSISTENT RESERVE OUT" command with the service action
set to "PREEMPT" or "PREEMPT AND ABORT", the "RESERVATION KEY" field
set to the server's reservation key, the service action "RESERVATION
KEY" field set to the reservation key associated with the non-
responding client, and the "TYPE" field set to 8h (Exclusive Access
- Registrants Only).
Faibish Expires January 6, 2021 [Page 8]
Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020
2.2.4.2 NVMe Client
A host that is a registrant may preempt a reservation and/or
registration by executing a Reservation Acquire command (refer to
section 6.10), setting the Reservation Acquire Action (RACQA) field
to 001b (Preempt), and supplying the current reservation key
associated with the host in the Current Reservation Key (CRKEY)
field. The CRKEY value shall match that used by the registrant to
register with the namespace. If the CRKEY value does not match,
then the command is aborted with status Reservation Conflict.
If the PRKEY field value does not match that of the current
reservation holder and is equal to 0h, then the command is aborted
with status Invalid Field in Command. A reservation preempted
notification occurs on all controllers in the NVM subsystem that
are associated with hosts that have their registrations removed as
a result of actions taken in this section except those associated
with the host that issued the Reservation Release command. After
the MDS preempts a client, all client I/O to the LU fails.
The client SHOULD at this point return any layout that refers to
the device ID that points to the LU.
2.2.5. Client Recovery after a Fence Action
A client that detects a NVMe status codes (I/O
error) on the storage devices MUST commit all layouts that use the
storage device through the MDS, return all outstanding layouts for
the device, forget the device ID, and unregister the reservation
key.
Future GETDEVICEINFO calls MAY refer to the storage device again,
in which case the client will perform a new registration based on
the key provided. If a reservation holder attempts to obtain a
reservation of a different type on a namespace for which that host
already is the reservation holder, then the command is aborted with
status Reservation Conflict. It is not an error if a reservation
holder attempts to obtain a reservation of the same type on a
namespace for which that host already is the reservation holder.
NVMe over Fabrics [NVMEoF] utilizes the same controller
architecture as that defined in the NVMe Base specification [NVME].
Faibish Expires January 6, 2021 [Page 9]
Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020
This includes using Submission and Completion Queues to execute
commands between a host and a controller. Section 8.20 of [NVME]
base specifications describes the relationship between a controller
(MDS) and a namespace associated to the Clients. In a static
controller model used by SCSI layout, controllers that may be
allocated to a particular Client may have different state at the
time the association is established.
2.3. Volatile write caches
The Volatile Write Cache Enable (WCE) bit (i.e., bit 00) in
the Volatile Write Cache Feature (Feature Identifier 06h)
is the Write Cache Enable field in the NVMe Get Features command,
see Section-5.21.1.6 of [NVME]. If a write cache is enable on a
NVMe device used as a storage device for the pNFS SCSI layout, the
MDS MUST ensure to use the NVMe FLUSH command to flush the
volatile write cache. If there is no volatile write cache on the
server, then attempts to access this NVMe Feature cause errors.
The Get Features command specifying the Volatile Write Cache feature
identifier is expected to fail with Invalid Field in Command status.
3. Security Considerations
Since no protocol changes are proposed here, no security
considerations apply. But the protocol is assuming that NVMe
Authentication commands are implemented in the NVMe
Security Protocol as the format of the data to be transferred is
dependent on the Security Protocol. Authentication Receive/Response
commands return the appropriate data corresponding to an
Authentication Send command as defined by the rules of the
Security Protocol. As the current draft is only supporting
MVMe over fabric In-band protocol the Authentication requirements
for security commands are based on the security protocol indicated
by the SECP field in the command and DO NOT require authentication
when used for NVMe in-band authentication. When used for other
purposes, in-band authentication of the commands is required.
4. IANA Considerations
The document does not require any actions by IANA.
Faibish Expires January 6, 2021 [Page 10]
Internet-Draft pNFS SCSI/NVMe Layout over Fabrics July 2020
5. Normative References
[NVME] NVM Express, Inc., "NVM Express Revision 1.4", June 10, 2019.
[NVMEoF] "NVM Express over Fabrics Revision 1.1", July 26, 2019
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", March 1997.
[RFC8154] Hellwig, C., "Parallel NFS (pNFS) Small Computer System
Interface (SCSI) Layout", May 2017.
[SBC3] INCITS Technical Committee T10, "SCSI Block Commands-3",
ANSI INCITS INCITS 514-2014, ISO/IEC 14776-323, 2014.
[SPC4] INCITS Technical Committee T10, "SCSI Primary Commands-4",
ANSI INCITS 513-2015, 2015.
[FC-NVMe] INCITS Technical Committee T10, "Fibre Channel -
Non-Volatile Memory Express", ANSI INCITS 540, 2018.
[RFC7525] Sheffer, Y., "Recommendations for Secure Use of Transport
Layer Security (TLS) and Datagram Transport Layer Security
(DTLS)" alsa known as BCP 195.
Author's Address
Sorin Faibish
Dell EMC
228 South Street
Hopkinton, MA 01774
United States of America
Phone: +1 508-249-5745
Email: faibish.sorin@dell.com
David Black
Dell EMC
176 South Street
Hopkinton, MA 01748
United States of America
Phone: +1 774-350-9323
Email: david.black@dell.com
Christoph Hellwig
Email: hch@lst.de
Faibish Expires January 6, 2021 [Page 11]