Internet DRAFT - draft-hellwig-nfsv4-rdma-layout
draft-hellwig-nfsv4-rdma-layout
NFSv4 C. Hellwig
Internet-Draft July 02, 2017
Intended status: Standards Track
Expires: January 3, 2018
Parallel NFS (pNFS) RDMA Layout
draft-hellwig-nfsv4-rdma-layout-00.txt
Abstract
The Parallel Network File System (pNFS) allows a separation between
the metadata (onto a metadata server) and data (onto a storage
device) for a file. The RDMA Layout Type is defined in this document
as an extension to pNFS to allow the use of RDMA Verbs operations to
access remote storage, with a special focus on accessing byte
addressable persistent memory.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 3, 2018.
Copyright Notice
Copyright (c) 2017 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
Hellwig Expires January 3, 2018 [Page 1]
Internet-Draft pNFS RDMA Layout July 2017
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Conventions Used in This Document . . . . . . . . . . . . 3
1.2. General Definitions . . . . . . . . . . . . . . . . . . . 3
1.3. Code Components Licensing Notice . . . . . . . . . . . . 4
1.4. XDR Description . . . . . . . . . . . . . . . . . . . . . 4
2. RDMA Layout Description . . . . . . . . . . . . . . . . . . . 6
2.1. Background and Architecture . . . . . . . . . . . . . . . 6
2.2. layouttype4 . . . . . . . . . . . . . . . . . . . . . . . 6
2.3. Device Addressing and Discovery . . . . . . . . . . . . . 7
2.3.1. pnfs_rdma_device_addr4 . . . . . . . . . . . . . . . 7
2.4. Data Structures: Extents and Extent Lists . . . . . . . . 7
2.4.1. Layout Requests and Extent Lists . . . . . . . . . . 9
2.4.2. Layout Commits . . . . . . . . . . . . . . . . . . . 11
2.4.3. Layout Returns . . . . . . . . . . . . . . . . . . . 11
2.4.4. Layout Revocation . . . . . . . . . . . . . . . . . . 12
2.4.5. Client Copy-on-Write Processing . . . . . . . . . . . 12
2.4.6. Extents are Permissions . . . . . . . . . . . . . . . 13
2.4.7. End-of-file Processing . . . . . . . . . . . . . . . 14
2.4.8. Layout Hints . . . . . . . . . . . . . . . . . . . . 15
2.5. Crash Recovery Issues . . . . . . . . . . . . . . . . . . 15
2.6. Transient and Permanent Errors . . . . . . . . . . . . . 15
3. Security Considerations . . . . . . . . . . . . . . . . . . . 16
4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17
5. References . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.1. Normative References . . . . . . . . . . . . . . . . . . 17
5.2. Informative References . . . . . . . . . . . . . . . . . 18
Appendix A. RFC Editor Notes . . . . . . . . . . . . . . . . . . 18
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 18
1. Introduction
Figure 1 shows the overall architecture of a Parallel NFS (pNFS)
system:
Hellwig Expires January 3, 2018 [Page 2]
Internet-Draft pNFS RDMA Layout July 2017
+-----------+
|+-----------+ +-----------+
||+-----------+ | |
||| | NFSv4.1 + pNFS | |
+|| Clients |<------------------------------>| Server |
+| | | |
+-----------+ | |
||| +-----------+
||| |
||| |
||| Storage +-----------+ |
||| Protocol |+-----------+ |
||+----------------||+-----------+ Control |
|+-----------------||| | Protocol|
+------------------+|| Storage |------------+
+| Systems |
+-----------+
Figure 1
The overall approach is that pNFS-enhanced clients obtain sufficient
information from the server to enable them to access the underlying
storage (on the storage systems) directly. See the Section 12 of
[RFC5661] for more details. RDMA ([RFC5040] [RFC5041] [IBARCH]) is a
technique for moving data efficiently between end nodes. By
directing data into destination buffers as it is sent on a network,
and placing it via direct memory access by hardware, the benefits of
faster transfers and reduced host overhead are obtained. Unlike the
RPC RDMA transport [RFC8166] the pNFS RDMA layout does not transfer
remote procedural calls over RDMA networks, but instead uses raw RDMA
READ and WRITE operations to access a memory region exposed on a
storage device.
1.1. Conventions Used in This Document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
1.2. General Definitions
The following definitions are provided for the purpose of providing
an appropriate context for the reader.
Byte This document defines a byte as an octet, i.e., a datum exactly
8 bits in length.
Hellwig Expires January 3, 2018 [Page 3]
Internet-Draft pNFS RDMA Layout July 2017
Client The "client" is the entity that accesses the NFS server's
resources. The client may be an application that contains the
logic to access the NFS server directly. The client may also be
the traditional operating system client that provides remote file
system services for a set of applications.
Server The "server" is the entity responsible for coordinating
client access to a set of file systems and is identified by a
server owner.
metadata server (MDS) The metadata server is a pNFS server which
provides metadata information for a file system object. It also
is responsible for generating layouts for file system objects.
Note that the MDS is also responsible for directory-based
operations.
1.3. Code Components Licensing Notice
The external data representation (XDR) description and scripts for
extracting the XDR description are Code Components as described in
Section 4 of "Legal Provisions Relating to IETF Documents" [LEGAL].
These Code Components are licensed according to the terms of
Section 4 of "Legal Provisions Relating to IETF Documents".
1.4. XDR Description
This document contains the XDR [RFC4506] description of the NFSv4.1
RDMA layout protocol. The XDR description is embedded in this
document in a way that makes it simple for the reader to extract into
a ready-to-compile form. The reader can feed this document into the
following shell script to produce the machine readable XDR
description of the NFSv4.1 RDMA layout:
#!/bin/sh
grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'
That is, if the above script is stored in a file called "extract.sh",
and this document is in a file called "spec.txt", then the reader can
do:
sh extract.sh < spec.txt > rdma_prot.x
The effect of the script is to remove leading white space from each
line, plus a sentinel sequence of "///".
The embedded XDR file header follows. Subsequent XDR descriptions,
with the sentinel sequence are embedded throughout the document.
Hellwig Expires January 3, 2018 [Page 4]
Internet-Draft pNFS RDMA Layout July 2017
Note that the XDR code contained in this document depends on types
from the NFSv4.1 nfs4_prot.x file [RFC5662]. This includes both nfs
types that end with a 4, such as offset4, length4, etc., as well as
more generic types such as uint32_t and uint64_t.
/// /*
/// * This code was derived from RFCTBD10
/// * Please reproduce this note if possible.
/// */
/// /*
/// * Copyright (c) 2010,2015 IETF Trust and the persons
/// * identified as the document authors. All rights reserved.
/// *
/// * Redistribution and use in source and binary forms, with
/// * or without modification, are permitted provided that the
/// * following conditions are met:
/// *
/// * - Redistributions of source code must retain the above
/// * copyright notice, this list of conditions and the
/// * following disclaimer.
/// *
/// * - Redistributions in binary form must reproduce the above
/// * copyright notice, this list of conditions and the
/// * following disclaimer in the documentation and/or other
/// * materials provided with the distribution.
/// *
/// * - Neither the name of Internet Society, IETF or IETF
/// * Trust, nor the names of specific contributors, may be
/// * used to endorse or promote products derived from this
/// * software without specific prior written permission.
/// *
/// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
/// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
/// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
/// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
/// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
/// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
/// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
/// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
/// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
/// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
/// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
/// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
/// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
/// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
/// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
/// */
///
Hellwig Expires January 3, 2018 [Page 5]
Internet-Draft pNFS RDMA Layout July 2017
/// /*
/// * nfs4_rdma_layout_prot.x
/// */
///
/// %#include "nfsv41.h"
///
2. RDMA Layout Description
2.1. Background and Architecture
A pNFS RDMA layout is responsible for mapping from an NFS file (or
portion of a file) to memory regions that contain the file. These
regions are expressed as extents with 64-bit offsets and lengths
using the existing NFSv4 offset4 and length4 types, and map to memory
regions that the servers registered, and for which it exposes a
handle (R_key or stag) that allows for RDMA READ and RDMA WRITE
operations from the client.
The pNFS operation for requesting a layout (LAYOUTGET) includes the
"layoutiomode4 loga_iomode" argument, which indicates whether the
requested layout is for read-only use or read-write use. A read-only
layout may contain holes that are read as zero, whereas a read-write
layout will contain allocated, but un-initialized storage in those
holes (read as zero, can be written by client). This document also
supports client participation in copy-on-write (e.g., for file
systems with snapshots) by providing both read-only and un-
initialized storage for the same extent in a layout. Reads are
initially performed on the read-only storage, with writes going to
the un-initialized storage. After the first write that initializes
the un-initialized storage, all reads are performed to that now-
initialized writable storage, and the corresponding read-only storage
is no longer used.
2.2. layouttype4
The layout4 type defined in [RFC5662] is extended with a new value as
follows:
enum layouttype4 {
LAYOUT4_NFSV4_1_FILES = 1,
LAYOUT4_OSD2_OBJECTS = 2,
LAYOUT4_BLOCK_VOLUME = 3,
LAYOUT4_SCSI = 4,
LAYOUT4_RDMA = 0x80000006
[[RFC Editor: please modify the LAYOUT4_RDMA
to be the layouttype assigned by IANA]]
};
Hellwig Expires January 3, 2018 [Page 6]
Internet-Draft pNFS RDMA Layout July 2017
This document defines structure associated with the layouttype4 value
LAYOUT4_RDMA. [RFC5661] specifies the loc_body structure as an XDR
type "opaque". The opaque layout is uninterpreted by the generic
pNFS client layers, but obviously must be interpreted by the Layout
Type implementation.
2.3. Device Addressing and Discovery
Data operations to a storage device require the client to know the
network address of the storage device. The NFSv4.1+ GETDEVICEINFO
operation (Section 18.40 of [RFC5661]) is used by the client to
retrieve that information.
2.3.1. pnfs_rdma_device_addr4
The "pnfs_rdma_device_addr4" data structure is returned by the server
as the storage-protocol-specific opaque field da_addr_body in the
"device_addr4" structure by a successful GETDEVICEINFO operation
[RFC5661]. It contains the network address of the storage device.
The RDMA Connection manager (RDMA/CM) shall be used to establish the
queue pair for the RDMA READ and RDMA WRITE operations used by the
layout. Details of connection establishment will be provided in
future versions of this document.
/// struct pnfs_rdma_device_addr4 {
/// struct netaddr4 addr; /* address of the device */
/// };
///
2.4. Data Structures: Extents and Extent Lists
A pNFS RDMA layout is a list of extents within a flat array of data
in a device. The RDMA layout describes the individual byte ranges
(extents) on the device that make up the file. The offsets and
length contained in an extent are specified in units of bytes.
Hellwig Expires January 3, 2018 [Page 7]
Internet-Draft pNFS RDMA Layout July 2017
/// enum pnfs_rdma_extent_state4 {
/// PNFS_RDMA_READ_WRITE_DATA = 0, /* the data located by
/// this extent is valid
/// for reading and
/// writing. */
/// PNFS_RDMA_READ_DATA = 1, /* the data located by this
/// extent is valid for
/// reading only; it may not
/// be written. */
/// PNFS_RDMA_INVALID_DATA = 2, /* the location is valid; the
/// data is invalid. It is a
/// newly (pre-) allocated
/// extent. The client MUST
/// not read from this
/// space */
/// PNFS_RDMA_NONE_DATA = 3 /* the location is invalid.
/// It is a hole in the file.
/// The client MUST NOT read
/// from or write to this
/// space */
/// };
///
/// struct pnfs_rdma_extent4 {
/// deviceid4 re_device_id; /* id of the device on
/// which extent of file is
/// stored. */
/// offset4 re_file_offset; /* starting byte offset
/// in the file */
/// uint32 re_handle; /* registered memory
/// handle */
/// length4 re_length; /* size in bytes of the
/// extent */
/// offset4 re_storage_offset;/* starting byte offset
/// in the volume */
/// pnfs_rdma_extent_state4 re_state;
/// /* state of this extent */
/// };
///
/// /* RDMA layout-specific type for loc_body */
/// struct pnfs_rdma_layout4 {
/// pnfs_rdma_extent4 rl_extents<>;
/// /* extents which make up this
/// layout. */
/// };
///
Hellwig Expires January 3, 2018 [Page 8]
Internet-Draft pNFS RDMA Layout July 2017
The RDMA layout consists of a list of extents that map the regions of
the file to locations on a device. The "re_storage_offset" field
within each extent identifies a location on the device specified by
the "re_device_id" field in the extent.
Each extent maps a region of the file onto a portion of the specified
device. The re_file_offset, re_length, and re_state fields for an
extent returned from the server are valid for all extents. In
contrast, the interpretation of the re_storage_offset field depends
on the value of re_state as follows (in increasing order):
PNFS_RDMA_READ_WRITE_DATA means that re_storage_offset is valid, and
points to valid/initialized data that can be read and written.
PNFS_RDMA_READ_DATA means that re_storage_offset is valid and points
to valid/initialized data that can only be read. Write operations
are prohibited.
PNFS_RDMA_INVALID_DATA means that re_storage_offset is valid, but
points to invalid un-initialized data. This data MUST not be read
from the device until it has been initialized. A read request for
a PNFS_RDMA_INVALID_DATA extent MUST fill the user buffer with
zeros, unless the extent is covered by a PNFS_RDMA_READ_DATA
extent of a copy-on-write file system. Write requests MUST write
whole server-sized blocks to the device; bytes not initialized by
the user MUST be set to zero. Any write to parts of a device
covered by a PNFS_RDMA_INVALID_DATA extent changes the written
portion of the extent to PNFS_RDMA_READ_WRITE_DATA; the pNFS
client is responsible for reporting this change via LAYOUTCOMMIT.
PNFS_RDMA_NONE_DATA means that re_storage_offset is not valid, and
this extent MAY not be used to satisfy write requests. Read
requests MAY be satisfied by zero-filling as for
PNFS_RDMA_INVALID_DATA. PNFS_RDMA_NONE_DATA extents MAY be
returned by requests for readable extents; they are never returned
if the request was for a writable extent.
An extent list contains all relevant extents in increasing order of
the re_file_offset of each extent; any ties are broken by increasing
order of the extent state (re_state).
2.4.1. Layout Requests and Extent Lists
Each request for a layout specifies at least three parameters: file
offset, desired size, and minimum size. If the status of a request
indicates success, the extent list returned MUST meet the following
criteria:
Hellwig Expires January 3, 2018 [Page 9]
Internet-Draft pNFS RDMA Layout July 2017
o A request for a readable (but not writable) layout MUST return
either PNFS_RDMA_READ_DATA or PNFS_RDMA_NONE_DATA extents. It
SHALL NOT return PNFS_RDMA_INVALID_DATA or
PNFS_RDMA_READ_WRITE_DATA extents.
o A request for a writable layout MUST return
PNFS_RDMA_READ_WRITE_DATA or PNFS_RDMA_INVALID_DATA extents, and
it MAY return addition PNFS_RDMA_READ_DATA extents for ranges
covered by PNFS_RDMA_INVALID_DATA extents to allow client side
copy-on-write operations. A request for a writable layout SHALL
NOT return PNFS_RDMA_NONE_DATA extents.
o The first extent in the list MUST contain the requested starting
offset.
o The total size of extents within the requested range MUST cover at
least the minimum size. One exception is allowed: the total size
MAY be smaller if only readable extents were requested and EOF is
encountered.
o Extents in the extent list MUST be logically contiguous for a
read-only layout. For a read-write layout, the set of writable
extents (i.e., excluding PNFS_RDMA_READ_DATA extents) MUST be
logically contiguous. Every PNFS_RDMA_READ_DATA extent in a read-
write layout MUST be covered by one or more PNFS_RDMA_INVALID_DATA
extents. This overlap of PNFS_RDMA_READ_DATA and
PNFS_RDMA_INVALID_DATA extents is the only permitted extent
overlap.
o Extents MUST be ordered in the list by starting offset, with
PNFS_RDMA_READ_DATA extents preceding PNFS_RDMA_INVALID_DATA
extents in the case of equal re_file_offsets.
The server shall ensure that it has registered handles for the memory
regions that the extents in the layout refer to so that RDMA READ
and/or RDMA WRITE requests can be performed by the client. Multiple
extents may refer to the same handle. The handle shall be
invalidated on LAYOUTRETURN operation, including implicit layout
returns as part of CB_LAYOUTRECALL operations, or when a layout is
revoked.
According to [RFC5661], if the minimum requested size,
loga_minlength, is zero, this is an indication to the metadata server
that the client desires any layout at offset loga_offset or less that
the metadata server has "readily available". Given the lack of a
clear definition of this phrase, in the context of the RDMA layout
type, when loga_minlength is zero, the metadata server SHOULD:
Hellwig Expires January 3, 2018 [Page 10]
Internet-Draft pNFS RDMA Layout July 2017
o when processing requests for readable layouts, return all such,
even if some extents are in the PNFS_RDMA_NONE_DATA state.
o when processing requests for writable layouts, return extents
which can be returned in the PNFS_RDMA_READ_WRITE_DATA state.
2.4.2. Layout Commits
///
/// /* RDMA layout-specific type for lou_body */
///
/// struct pnfs_rdma_range4 {
/// offset4 rr_file_offset; /* starting byte offset
/// in the file */
/// length4 rr_length; /* size in bytes */
/// };
///
/// struct pnfs_rdma_layoutupdate4 {
/// pnfs_rdma_range4 rlu_commit_list<>;
/// /* list of extents which
/// * now contain valid data.
/// */
/// };
The "pnfs_rdma_layoutupdate4" structure is used by the client as the
RDMA layout-specific argument in a LAYOUTCOMMIT operation. The
"rlu_commit_list" field is a list covering regions of the file layout
that were previously in the PNFS_RDMA_INVALID_DATA state, but have
been written by the client and SHOULD now be considered in the
PNFS_RDMA_READ_WRITE_DATA state. The extents in the commit list MUST
be disjoint and MUST be sorted by rr_file_offset. Implementors
should be aware that a server MAY be unable to commit regions at a
granularity smaller than a file-system block (typically 4 KB or 8
KB). As noted above, the block-size that the server uses is
available as an NFSv4 attribute, and any extents included in the
"rlu_commit_list" MUST be aligned to this granularity and have a size
that is a multiple of this granularity. Since the block in question
is in state PNFS_RDMA_INVALID_DATA, byte ranges not written SHOULD be
filled with zeros. This applies even if it appears that the area
being written is beyond what the client believes to be the end of
file.
2.4.3. Layout Returns
A LAYOUTRETURN operation represents an explicit release of resources
by the client. This MAY be done in response to a CB_LAYOUTRECALL or
before any recall, in order to avoid a future CB_LAYOUTRECALL. When
the LAYOUTRETURN operation specifies a LAYOUTRETURN4_FILE return
Hellwig Expires January 3, 2018 [Page 11]
Internet-Draft pNFS RDMA Layout July 2017
type, then the layoutreturn_file4 data structure specifies the region
of the file layout that is no longer needed by the client.
The LAYOUTRETURN operation is done without any RDMA layout specific
data. The opaque "lrf_body" field of the "layoutreturn_file4" data
structure MUST have length zero.
2.4.4. Layout Revocation
Layouts MAY be unilaterally revoked by the server, due to the
client's lease time expiring, or the client failing to return a
layout which has been recalled in a timely manner. For the RDMA
layout type this is accomplished by invalidating the handle for the
remote memory region exposed to the client. Once the invalidation
has completed the HCA will reject all access from the client to the
memory region.
2.4.5. Client Copy-on-Write Processing
Copy-on-write is a mechanism used to support file and/or file system
snapshots. When writing to unaligned regions, or to regions smaller
than a file system block, the writer MUST copy the portions of the
original file data to a new location on disk. This behavior can
either be implemented on the client or the server. The paragraphs
below describe how a pNFS RDMA layout client implements access to a
file that requires copy-on-write semantics.
Distinguishing the PNFS_RDMA_READ_WRITE_DATA and PNFS_RDMA_READ_DATA
extent types in combination with the allowed overlap of
PNFS_RDMA_READ_DATA extents with PNFS_RDMA_INVALID_DATA extents
allows copy-on-write processing to be done by pNFS clients. In
classic NFS, this operation would be done by the server. Since pNFS
enables clients to do direct block access, it is useful for clients
to participate in copy-on-write operations. All pNFS RDMA layout
clients MUST support this copy-on-write processing.
When a client wishes to write data covered by a PNFS_RDMA_READ_DATA
extent, it MUST have requested a writable layout from the server;
that layout will contain PNFS_RDMA_INVALID_DATA extents to cover all
the data ranges of that layout's PNFS_RDMA_READ_DATA extents. More
precisely, for any re_file_offset range covered by one or more
PNFS_RDMA_READ_DATA extents in a writable layout, the server MUST
include one or more PNFS_RDMA_INVALID_DATA extents in the layout that
cover the same re_file_offset range. When performing a write to such
an area of a layout, the client MUST effectively copy the data from
the PNFS_RDMA_READ_DATA extent for any partial blocks of
re_file_offset and range, merge in the changes to be written, and
write the result to the PNFS_RDMA_INVALID_DATA extent for the blocks
Hellwig Expires January 3, 2018 [Page 12]
Internet-Draft pNFS RDMA Layout July 2017
for that re_file_offset and range. That is, if entire blocks of data
are to be overwritten by an operation, the corresponding
PNFS_RDMA_READ_DATA blocks need not be fetched, but any partial-
block writes MUST be merged with data fetched via PNFS_RDMA_READ_DATA
extents before storing the result via PNFS_RDMA_INVALID_DATA extents.
For the purposes of this discussion, "entire blocks" and "partial
blocks" refer to the server's file-system block size. Storing of
data in a PNFS_RDMA_INVALID_DATA extent converts the written portion
of the PNFS_RDMA_INVALID_DATA extent to a PNFS_RDMA_READ_WRITE_DATA
extent; all subsequent reads MUST be performed from this extent; the
corresponding portion of the PNFS_RDMA_READ_DATA extent MUST NOT be
used after storing data in a PNFS_RDMA_INVALID_DATA extent. If a
client writes only a portion of an extent, the extent MAY be split at
block aligned boundaries.
When a client wishes to write data to a PNFS_RDMA_INVALID_DATA extent
that is not covered by a PNFS_RDMA_READ_DATA extent, it MUST treat
this write identically to a write to a file not involved with copy-
on-write semantics. Thus, data MUST be written in at least block-
sized increments, aligned to multiples of block-sized offsets, and
unwritten portions of blocks MUST be zero filled.
2.4.6. Extents are Permissions
Layout extents returned to pNFS clients grant permission to read or
write; PNFS_RDMA_READ_DATA and PNFS_RDMA_NONE_DATA are read-only
(PNFS_RDMA_NONE_DATA reads as zeroes), PNFS_RDMA_READ_WRITE_DATA and
PNFS_RDMA_INVALID_DATA are read/write, (PNFS_RDMA_INVALID_DATA reads
as zeros, any write converts it to PNFS_RDMA_READ_WRITE_DATA). This
is the only means a client has of obtaining permission to perform
direct I/O to storage devices; a pNFS client MUST NOT perform direct
I/O operations that are not permitted by an extent held by the
client. Client adherence to this rule places the pNFS server in
control of potentially conflicting storage device operations,
enabling the server to determine what does conflict and how to avoid
conflicts by granting and recalling extents to/from clients.
If a client makes a layout request that conflicts with an existing
layout delegation, the request will be rejected with the error
NFS4ERR_LAYOUTTRYLATER. This client is then expected to retry the
request after a short interval. During this interval, the server
SHOULD recall the conflicting portion of the layout delegation from
the client that currently holds it. This reject-and-retry approach
does not prevent client starvation when there is contention for the
layout of a particular file. For this reason, a pNFS server SHOULD
implement a mechanism to prevent starvation. One possibility is that
the server can maintain a queue of rejected layout requests. Each
new layout request can be checked to see if it conflicts with a
Hellwig Expires January 3, 2018 [Page 13]
Internet-Draft pNFS RDMA Layout July 2017
previous rejected request, and if so, the newer request can be
rejected. Once the original requesting client retries its request,
its entry in the rejected request queue can be cleared, or the entry
in the rejected request queue can be removed when it reaches a
certain age.
NFSv4 supports mandatory locks and share reservations. These are
mechanisms that clients can use to restrict the set of I/O operations
that are permissible to other clients. Since all I/O operations
ultimately arrive at the NFSv4 server for processing, the server is
in a position to enforce these restrictions. However, with pNFS
layouts, I/Os will be issued from the clients that hold the layouts
directly to the storage devices that host the data. These devices
have no knowledge of files, mandatory locks, or share reservations,
and are not in a position to enforce such restrictions. For this
reason the NFSv4 server MUST NOT grant layouts that conflict with
mandatory locks or share reservations. Further, if a conflicting
mandatory lock request or a conflicting open request arrives at the
server, the server MUST recall the part of the layout in conflict
with the request before granting the request.
2.4.7. End-of-file Processing
The end-of-file location can be changed in two ways: implicitly as
the result of a WRITE or LAYOUTCOMMIT beyond the current end-of-file,
or explicitly as the result of a SETATTR request. Typically, when a
file is truncated by an NFSv4 client via the SETATTR call, the server
frees any disk blocks belonging to the file that are beyond the new
end-of-file byte, and MUST write zeros to the portion of the new end-
of-file block beyond the new end-of-file byte. These actions render
any pNFS layouts that refer to the blocks that are freed or written
semantically invalid. Therefore, the server MUST recall from clients
the portions of any pNFS layouts that refer to blocks that will be
freed or written by the server before effecting the file truncation.
These recalls may take time to complete; as explained in [RFC5661],
if the server cannot respond to the client SETATTR request in a
reasonable amount of time, it SHOULD reply to the client with the
error NFS4ERR_DELAY.
Blocks in the PNFS_RDMA_INVALID_DATA state that lie beyond the new
end-of-file block present a special case. The server has reserved
these blocks for use by a pNFS client with a writable layout for the
file, but the client has yet to commit the blocks, and they are not
yet a part of the file mapping on disk. The server MAY free these
blocks while processing the SETATTR request. If so, the server MUST
recall any layouts from pNFS clients that refer to the blocks before
processing the truncate. If the server does not free the
PNFS_RDMA_INVALID_DATA blocks while processing the SETATTR request,
Hellwig Expires January 3, 2018 [Page 14]
Internet-Draft pNFS RDMA Layout July 2017
it need not recall layouts that refer only to the
PNFS_RDMA_INVALID_DATA blocks.
When a file is extended implicitly by a WRITE or LAYOUTCOMMIT beyond
the current end-of-file, or extended explicitly by a SETATTR request,
the server need not recall any portions of any pNFS layouts.
2.4.8. Layout Hints
The layout hint attribute specified in [RFC5661] is not supported by
the RDMA layout, and the pNFS server MUST reject setting a layout
hint attribute with a loh_type value of LAYOUT4_RDMA_VOLUME during
OPEN or SETATTR operations. On a file system only supporting the
RDMA layout a server MUST NOT report the layout_hint attribute in the
supported_attrs attribute.
2.5. Crash Recovery Issues
A critical requirement in crash recovery is that both the client and
the server know when the other has failed. Additionally, it is
required that a client sees a consistent view of data across server
restarts. These requirements and a full discussion of crash recovery
issues are covered in the "Crash Recovery" section of the NFSv41
specification [RFC5661]. This document contains additional crash
recovery material specific only to the RDMA layout.
When the server crashes while the client holds a writable layout, and
the client has written data to blocks covered by the layout, and the
blocks are still in the PNFS_RDMA_INVALID_DATA state, the client has
two options for recovery. If the data that has been written to these
blocks is still cached by the client, the client can simply re-write
the data via NFSv4, once the server has come back online. However,
if the data is no longer in the client's cache, the client MUST NOT
attempt to source the data from the data servers. Instead, it SHOULD
attempt to commit the blocks in question to the server during the
server's recovery grace period, by sending a LAYOUTCOMMIT with the
"loca_reclaim" flag set to true. This process is described in detail
in Section 18.42.4 of [RFC5661].
2.6. Transient and Permanent Errors
The server may respond to LAYOUTGET with a variety of error statuses.
These errors can convey transient conditions or more permanent
conditions that are unlikely to be resolved soon.
The error NFS4ERR_RECALLCONFLICT indicates that the server has
recently issued a CB_LAYOUTRECALL to the requesting client, making it
necessary for the client to respond to the recall before processing
Hellwig Expires January 3, 2018 [Page 15]
Internet-Draft pNFS RDMA Layout July 2017
the layout request. A client can wait for that recall to be receive
and processe or it can retry as for NFS4ERR_TRYLATER, as described
below.
The error NFS4ERR_TRYLATER is used to indicate that the server cannot
immediately grant the layout to the client. This may be due to
constraints on writable sharing of blocks by multiple clients or to a
conflict with a recallable lock (e.g. a delegation). In either case,
a reasonable approach for the client is to wait several milliseconds
and retry the request. The client SHOULD track the number of
retries, and if forward progress is not made, the client SHOULD
abandon the attempt to get a layout and perform READ and WRITE
operations by sending them to the server
The error NFS4ERR_LAYOUTUNAVAILABLE MAY be returned by the server if
layouts are not supported for the requested file or its containing
file system. The server MAY also return this error code if the
server is the progress of migrating the file from secondary storage,
there is a conflicting lock that would prevent the layout from being
granted, or for any other reason that causes the server to be unable
to supply the layout. As a result of receiving
NFS4ERR_LAYOUTUNAVAILABLE, the client SHOULD abandon the attempt to
get a layout and perform READ and WRITE operations by sending them to
the MDS. It is expected that a client will not cache the file's
layoutunavailable state forever. In particular, when the file is
closed or opened by the client, issuing a new LAYOUTGET is
appropriate.
3. Security Considerations
The pNFS extension partitions the NFSv4.1+ file system protocol into
two parts, the control path and the data path (storage protocol).
The control path contains all the new operations described by this
extension; all existing NFSv4 security mechanisms and features apply
to the control path. The combination of components in a pNFS system
is required to preserve the security properties of NFSv4.1+ with
respect to an entity accessing data via a client, including security
countermeasures to defend against threats that NFSv4.1+ provides
defenses for in environments where these threats are considered
significant.
The metadata server enforces the file access-control policy at
LAYOUTGET time. The client should use suitable authorization
credentials for getting the layout for the requested iomode (READ or
RW) and the server verifies the permissions and ACL for these
credentials, possibly returning NFS4ERR_ACCESS if the client is not
allowed the requested iomode. If the LAYOUTGET operation succeeds
the client receives, as part of the layout, a set of credentials
Hellwig Expires January 3, 2018 [Page 16]
Internet-Draft pNFS RDMA Layout July 2017
allowing it I/O access to the specified data files corresponding to
the requested iomode. When the client acts on I/O operations on
behalf of its local users, it MUST authenticate and authorize the
user by issuing respective OPEN and ACCESS calls to the metadata
server, similar to having NFSv4 data delegations. If access is
allowed, the client uses the corresponding (READ or RW) credentials
to perform the I/O operations at the data file's storage devices.
When the metadata server receives a request to change a file's
permissions or ACL, it SHOULD recall all layouts for that file and it
MUST fence off the clients holding outstanding layouts for the
respective file by implicitly invalidating the outstanding
credentials on all data files comprising before committing to the new
permissions and ACL. Doing this will ensure that clients re-
authorize their layouts according to the modified permissions and ACL
by requesting new layouts. Recalling the layouts in this case is
courtesy of the server intended to prevent clients from getting an
error on I/Os done after the client was fenced off.
4. IANA Considerations
IANA is requested to assign a new pNFS layout type in the pNFS Layout
Types Registry as follows (the value 5 is suggested): Layout Type
Name: LAYOUT4_RDMA Value: 0x00000006 RFC: RFCTBD10 How: L (new layout
type) Minor Versions: 1
5. References
5.1. Normative References
[LEGAL] IETF Trust, "Legal Provisions Relating to IETF Documents",
November 2008, <http://trustee.ietf.org/docs/
IETF-Trust-License-Policy.pdf>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", March 1997.
[RFC4506] Eisler, M., "XDR: External Data Representation Standard",
STD 67, RFC 4506, May 2006.
[RFC5661] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
"Network File System (NFS) Version 4 Minor Version 1
Protocol", RFC 5661, January 2010.
[RFC5662] Shepler, S., Ed., Eisler, M., Ed., and D. Noveck, Ed.,
"Network File System (NFS) Version 4 Minor Version 1
External Data Representation Standard (XDR) Description",
RFC 5662, January 2010.
Hellwig Expires January 3, 2018 [Page 17]
Internet-Draft pNFS RDMA Layout July 2017
[RFC8166] Lever, C., Simpson, W., and T. Talpey, "Remote Direct
Memory Access Transport for Remote Procedure Call Version
1", RFC RFC8166, June 2017.
5.2. Informative References
[IBARCH] InfiniBand Trade Association, "InfiniBand Architecture
Specification Volume 1 Release 1.3", March 2015.
[RFC5040] Recio, B., Ed., Metzler, B., Ed., Culley, P., Ed.,
Hilland, J., Ed., and D. Garcia, Ed., "A Remote Direct
Memory Access Protocol Specification", RFC 5040, October
2007.
[RFC5041] Shah, H., Ed., Pinkerton, J., Ed., Recio, B., Ed., and P.
Culley, Ed., "Direct Data Placement over Reliable
Transports", RFC 5041, October 2007.
Appendix A. RFC Editor Notes
[RFC Editor: please remove this section prior to publishing this
document as an RFC]
[RFC Editor: prior to publishing this document as an RFC, please
replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the
RFC number of this document]
Author's Address
Christoph Hellwig
Email: hch@lst.de
Hellwig Expires January 3, 2018 [Page 18]