NFSv4 | C. Hellwig |
Internet-Draft | |
Intended status: Informational | October 23, 2014 |
Expires: April 26, 2015 |
Parallel NFS (pNFS) SCSI Layout
draft-hellwig-nfsv4-scsi-layout-00.txt
Parallel NFS (pNFS) extends Network File Sharing version 4 (RFC5661) to allow clients to directly access file data on the storage used by the NFSv4 server. This ability to bypass the server for data access can increase both performance and parallelism, but requires additional client functionality for data access, some of which is dependent on the class of storage used. The main pNFS operations document specifies storage-class-independent extensions to NFS, the pNFS Block/Volume Layout (RFC5663) specifies the additional extensions for use of pNFS with block-and volume-based storage, while this document provides extensions to the pNFS Block/Volume Layout document to provide reliable fencing and better device discoverability for SCSI based shared storage devices.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on April 26, 2015.
Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
In the parallel Network File System (pNFS), the metadata server returns Layout Type structures that describe where file data is located. There are different Layout Types for different storage systems and methods of arranging data on storage devices. This document extends the pNFS Block/Volume Layout [RFC5663] with a closer integration into the the SCSI Architecture Model ([SAM-4]) to provide a generic fencing method and more scalable device discovery.
This document only specifies an updated version of the layout-specific GETDEVICEINFO XDR response, and a new mandatory fencing method for SCSI devices, but refers to [RFC5663] for the basic principle of operation, as well as the layout specific XDR data structures for the LAYOUTGET and LAYOUTCOMMIT operations. This document does not directly interact with [RFC6688], although the mechanisms described in this document also archive the goals of [RFC6688], and do so in a more robust fashion that does not depend on the cooperation of the systems involved. Thus, the mechanisms specified in [RFC6688] are not necessary for a pNFS SCSI layout type implementation.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].
The following definitions are provided for the purpose of providing an appropriate context for the reader.
The external data representation (XDR) description and scripts for extracting the XDR description are Code Components as described in Section 4 of "Legal Provisions Relating to IETF Documents" [LEGAL]. These Code Components are licensed according to the terms of Section 4 of "Legal Provisions Relating to IETF Documents".
This document contains the XDR [RFC4506] description of the NFSv4.1 SCSI layout protocol. The XDR description is embedded in this document in a way that makes it simple for the reader to extract into a ready-to-compile form. The reader can feed this document into the following shell script to produce the machine readable XDR description of the NFSv4.1 SCSI layout:
#!/bin/sh grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'
That is, if the above script is stored in a file called "extract.sh", and this document is in a file called "spec.txt", then the reader can do:
sh extract.sh < spec.txt > flex_files_prot.x
The effect of the script is to remove leading white space from each line, plus a sentinel sequence of "///".
The embedded XDR file header follows. Subsequent XDR descriptions, with the sentinel sequence are embedded throughout the document.
Note that the XDR code contained in this document depends on types from the NFSv4.1 nfs4_prot.x file [RFC5662]. This includes both nfs types that end with a 4, such as offset4, length4, etc., as well as more generic types such as uint32_t and uint64_t.
/// /* /// * This code was derived from draft-hellwig-nfsv4-scsi-layout /// * Please reproduce this note if possible. /// */ /// /* /// * Copyright (c) 2010 IETF Trust and the persons identified /// * as the document authors. All rights reserved. /// * /// * Redistribution and use in source and binary forms, with /// * or without modification, are permitted provided that the /// * following conditions are met: /// * /// * - Redistributions of source code must retain the above /// * copyright notice, this list of conditions and the /// * following disclaimer. /// * /// * - Redistributions in binary form must reproduce the above /// * copyright notice, this list of conditions and the /// * following disclaimer in the documentation and/or other /// * materials provided with the distribution. /// * /// * - Neither the name of Internet Society, IETF or IETF /// * Trust, nor the names of specific contributors, may be /// * used to endorse or promote products derived from this /// * software without specific prior written permission. /// * /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. /// */ /// /// /* /// * nfs4_scsi_layout_prot.x /// */ /// /// %#include "nfs4_block_layout_prot.x" /// %#include "nfsv41.h" ///
The layout4 type is defined in [RFC5662] as follows:
enum layouttype4 { LAYOUT4_NFSV4_1_FILES = 1, LAYOUT4_OSD2_OBJECTS = 2, LAYOUT4_BLOCK_VOLUME = 3, LAYOUT4_SCSI = 0x80000005 [[RFC Editor: please modify the LAYOUT4_SCSI to be the layouttype assigned by IANA]] }; struct layout_content4 { layouttype4 loc_type; opaque loc_body<>; }; struct layout4 { offset4 lo_offset; length4 lo_length; layoutiomode4 lo_iomode; layout_content4 lo_content; };
This document defines structure associated with the layouttype4 value LAYOUT4_SCSI. [RFC5661] specifies the loc_body structure as an XDR type "opaque". The opaque layout is uninterpreted by the generic pNFS client layers, but obviously must be interpreted by the Layout Type implementation. All structures behind this opaque value are identical to those defined in [RFC5663].
/// /* /// * Code sets from SPC-3. /// */ /// enum pnfs_scsi_code_set { /// PS_CODE_SET_BINARY = 1, /// PS_CODE_SET_ASCII = 2, /// PS_CODE_SET_UTF8 = 3 /// }; /// /// /* /// * Designator types from taken from SPC-3. /// * /// * Other values are allocated in SPC-3, but not mandatory to /// * implement or aren't logical unit names. /// */ /// enum pnfs_scsi_designator_type { /// PS_DESIGNATOR_EUI64 = 2, /// PS_DESIGNATOR_NAA = 3, /// PS_DESIGNATOR_NAME = 8 /// }; /// /// /* /// * Logical unit name + reservation key. /// */ /// struct pnfs_scsi_base_volume_info4 { /// pnfs_scsi_code_set sbv_code_set; /// pnfs_scsi_designator_type sbv_designator_type; /// opaque sbv_lu_name<>; /// uint32_t sbv_pr_key; /// }; ///
GETDEVICEINFO calls are handled exactly the same way as specified in [RFC5663]. The "pnfs_scsi_volume_type4" data structure returned by the server as the storage-protocol-specific opaque field da_addr_body in the "device_addr4" structure by a successful GETDEVICEINFO operation [RFC5661] is a strict superset of the "pnfs_block_volume_type" structured defined by [RFC5663].
SCSI targets implementing [SPC3] export unique logical unit names for each logical unit through the Device Identification VPD page which can be obtained using the INQUIRY command. This document uses a subset of this information to identify logical units backing pNFS SCSI layouts. It is similar to the "Identification Descriptor Target Descriptor" specified in [SPC3], but limits the allowed values to those that uniquely identify a logical unit. Device Identification VPD page descriptors used to identify logical units for use with pNFS SCSI layouts must adhere to the following restrictions: [SPC3] for details, and note that ASCII may be used as the code set for UTF-8 text that does not contain non-ASCII characters. Note that a Device Identification VPD page MAY contain multiple descriptors with the same association, code set and designator type. NFS clients thus MUST iterate the descriptors until a match for "sbv_code_set", "sbv_designator_type" and "sbv_designator" is found, or until the end of VPD page.
The "CODE SET" VPD page field is stored in the "sbv_code_set" field of the "pnfs_scsi_base_volume_info4" structure, the "DESIGNATOR TYPE" is stored in "sbv_designator_type", and the DESIGNATOR is stored in "sbv_designator". Due to the use of a XDR array the "DESIGNATOR LENGTH" field does not need to be set separately. Only certain combinations of "sbv_code_set" and "sbv_designator_type" are valid, please refer to
Additionally the server returns a Persistent Reservation key in the "sbv_pr_key" field. See Section 2.2 for more details on the use of Persistent Reservations.
The pNFS SCSI server volume topology is expressed as an arbitrary combination of base volume types enumerated in the following data structures. The individual components of the topology are contained in an array and components may refer to other components by using array indices.
/// enum pnfs_scsi_volume_type4 { /// PNFS_SCSI_VOLUME_SIMPLE = /// PNFS_BLOCK_VOLUME_SIMPLE , /* invalid */ /// PNFS_SCSI_VOLUME_SLICE = /* see RFC5663 */ /// PNFS_BLOCK_VOLUME_SLICE, /// PNFS_SCSI_VOLUME_CONCAT = /* see RFC5663 */ /// PNFS_BLOCK_VOLUME_CONCAT, /// PNFS_SCSI_VOLUME_STRIPE = /* see RFC5663 */ /// PNFS_BLOCK_VOLUME_STRIPE, /// PNFS_SCSI_VOLUME_BASE = 4 /* SCSI LU */ /// }; ///
/// /// union pnfs_scsi_volume4 switch (pnfs_scsi_volume_type4 type) { /// case PNFS_SCSI_VOLUME_SIMPLE: /// pnfs_block_simple_volume_info4 sv_simple_info; /// case PNFS_SCSI_VOLUME_SLICE: /// pnfs_block_slice_volume_info4 sv_slice_info; /// case PNFS_SCSI_VOLUME_CONCAT: /// pnfs_block_concat_volume_info4 sv_concat_info; /// case PNFS_SCSI_VOLUME_STRIPE: /// pnfs_block_stripe_volume_info4 sv_stripe_info; /// case PNFS_SCSI_VOLUME_BASE: /// pnfs_scsi_base_volume_info4 sv_base_info; /// }; ///
/// /* scsi layout specific type for da_addr_body */ /// struct pnfs_scsi_deviceaddr4 { /// pnfs_scsi_volume4 sda_volumes<>; /* array of volumes */ /// }; ///
All rules for ordering and formation of a "pnfs_scsi_deviceaddr4" structure are identical for those of a "pnfs_block_deviceaddr4" structure in [RFC5663], except that the new pnfs_scsi_base_volume_info4 PNFS_SCSI_VOLUME_BASE case is used in place of the pnfs_block_simple_volume_info4 PNFS_BLOCK_VOLUME_SIMPLE case as the base structure. A PNFS_BLOCK_VOLUME_SIMPLE element MUST NOT be referenced by a pnfs_scsi_deviceaddr4, but is preserved for XDR level compatibility.
The pNFS block protocol must handle situations in which a system failure, typically a network connectivity issue, requires the server to unilaterally revoke extents from one client in order to transfer the extents to another client. The pNFS server implementation MUST ensure that when resources are transferred to another client, they are not used by the client originally owning them, and this must be ensured against any possible combination of partitions and delays among all of the participants to the protocol (server, storage and client). [RFC5663] suggest to either use LUN masking or cooperative clients. The first implementation requires the server and the storage device to have common notation of a client, which is impossible when the NFS and storage connection don't share a network, and requires a non-standardized control protocol between the MDS and the storage device. The second implementation relies on a cooperative client, which is not robust.
Instead this document specifies a new SCSI-specific fencing protocol using Persistent Reservations (PRs), similar to the fencing method used by existing shared disk file systems. By placing a PR of type "Exclusive Access – All Registrants" on each SCSI logical unit exported to pNFS clients the MDS prevents access from any client that does not have an outstanding device device ID that gives the client a reservation key to access to logical unit, and allows the MDS to revoke access to the logic unit at any time.
To allow fencing individual systems, each systems MUST use a unique Persistent Reservation key. [SPC3] does not specify a way to generate keys. This document assigns the burden to generate unique keys to the MDS, which MUST generate a key for itself before exporting a volume, and one for each client that access a volume. The MDS MAY either generate a key for each client that accesses logic units exported by the MDS, or to generate a key for each [logical unit, client] combination. In case of a single key per client, the MDS needs to be aware of the per-client fencing granularity.
Before returning a PNFS_SCSI_VOLUME_BASE volume to the client, the MDS needs to prepare the volume for fencing using PRs. This is done by registering the reservation generated for the MDS with the device using the "PERSISTENT RESERVE OUT" command with a service action of "REGISTER", followed by a "PERSISTENT RESERVE OUT" command, with a service action of "RESERVE" and the type field set to 8h (Exclusive Access – All Registrants). To make sure all I_T nexus are registered, the MDS SHOULD set the "All Target Ports" (ALL_TG_PT) bit when registering the key, or otherwise ensure the registration is performed for each initiator port.
After a successful GETDEVICEINFO operation the client MUST register the registration key returned in sbv_pr_key with the storage device by issuing a "PERSISTENT RESERVE OUT" command with a service action of REGISTER with the "SERVICE ACTION RESERVATION KEY" set to the reservation key returned in sbv_pr_key. To make sure all I_T nexus are registered, the client SHOULD set the "All Target Ports" (ALL_TG_PT) bit when registering the key, or otherwise ensure the registration is performed for each initiator port.
When a client stops using a device earlier returned by GETDEVICEINFO it MUST unregister the earlier registered key by issuing a "PERSISTENT RESERVE OUT" command with a service action of "REGISTER" with the "RESERVATION KEY" set to the earlier registered reservation key.
In case of a non-responding client the MDS MUST fence the client by issuing a "PERSISTENT RESERVE OUT" command with the service action set to "PREEMPT" or "PREEMPT AND ABORT", the reservation key field set to the server's reservation key, and the service action reservation key field set to the reservation key associated with the non-responding client, and the type field set to 8h (Exclusive Access – All Registrants).
After the MDS preempted a client, all client I/O to the logical unit will fail. The client should at this point return any layout that refers to the device ID that poіnts to the logical unit. Note that the client can distinguish I/O errors due to fencing from other errors based on the "RESERVATION CONFLICT" status. Refer to [SPC3] for details.
A client that detects I/O errors on the storage devices MUST commit through the MDS, return all outstanding layouts and forget any opened devices. Any device returned by a future LAYOUTGET will contain a new reservation key that can be used to gain access to the storage device.
The security considerations in [RFC5663] apply to this document as well.
IANA is requested to assign a new pNFS layout type in the pNFS Layout Types Registry as follows (the value 5 is suggested): Layout Type Name: LAYOUT4_SCSI Value: 0x00000005 RFC: RFCTBD10 How: L (new layout type) Minor Versions: 1
David Black, Robert Elliott and Tom Haynes provided a throughout review of early drafts of this document, and their input lead to the current form of the document.
[RFC Editor: please remove this section prior to publishing this document as an RFC]
[RFC Editor: prior to publishing this document as an RFC, please replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the RFC number of this document]