Network File System Version 4 | C. Lever, Ed. |
Internet-Draft | Oracle |
Intended status: Standards Track | D. Noveck |
Expires: May 12, 2019 | NetApp |
November 8, 2018 |
RPC-over-RDMA Version 2 Protocol
draft-cel-nfsv4-rpcrdma-version-two-08
This document specifies a new version of the transport protocol that conveys Remote Procedure Call (RPC) messages on physical transports capable of Remote Direct Memory Access (RDMA). The new version of this protocol is extensible.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 12, 2019.
Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English.
Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IBARCH] is a technique for moving data efficiently between end nodes. By directing data into destination buffers as it is sent on a network and placing it using direct memory access implemented by hardware, the complementary benefits of faster transfers and reduced host overhead are obtained.
RPC-over-RDMA version 1 enables ONC RPC [RFC5531] messages to be conveyed on RDMA transports. That protocol is specified in [RFC8166]. RPC-over-RDMA version 1 is deployed and in use, although there are known shortcomings to this protocol:
To address these issues in a way that is compatible with existing RPC-over-RDMA version 1 deployments, a new version of the RPC-over-RDMA transport protocol is presented in this document.
This new version of RPC-over-RDMA is extensible, enabling OPTIONAL extensions to be added without impacting existing implementations. To enable protocol extension, the XDR definition for RPC-over-RDMA version 2 is organized differently than the definition version 1. These changes, which are discussed in Section 10.1, do not affect the on-the-wire format.
In addition, RPC-over-RDMA version 2 contains a set of incremental changes that relieve certain performance constraints and enable recovery from certain abnormal corner cases. These changes include:
Because of the way in which RPC-over-RDMA version 2 builds upon the facilities present in RPC-over-RDMA version 1, a knowledge of the basic structure of RPC-over-RDMA version 1, as described in [RFC8166], is assumed in this document.
As in that document, the terms "RPC Payload Stream" and "Transport Header Stream" (defined in Section 3.2 of that document) are used to distinguish between an RPC message as defined by [RFC5531] and the header whose job it is to describe the RPC message and its associated memory resources. In that regard, the reader is assumed to understand how RDMA is used to transfer chunks between client and server, the use of Position-Zero Read chunks and Reply chunks to convey Long RPC messages, and the role of DDP-eligibility in constraining how data payloads are to be conveyed.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.
Most RPC-over-RDMA version 2 data structures are derived from corresponding structures in RPC-over-RDMA version 1. As is typical for new versions of an existing protocol, the XDR data structures have new names and there are a few small changes in content. In some cases, there have been structural re-organizations to enabled protocol extensibility.
<CODE BEGINS> struct rpcrdma_common { uint32 rdma_xid; uint32 rdma_vers; uint32 rdma_credit; uint32 rdma_htype; }; <CODE ENDS>
The rpcrdma_common prefix describes the first part of each RDMA-over-RPC transport header for version 2 and subsequent versions.
RPC-over-RDMA version 2's use of these first four words matches that of version 1 as required by [RFC8166]. However, there are important structural differences in the way that these words are described by the respective XDR descriptions: Section 6 reflects these changes, which are discussed in further detail in Section 10.1.
These changes are part of a larger structural change in the XDR description of RPC-over-RDMA version 2 that enables a cleaner treatment of protocol extension. The XDR appearing in
<CODE BEGINS> const RPCRDMA2_F_RESPONSE 0x00000001; struct rpcrdma2_hdr_prefix struct rpcrdma_common rdma_start; uint32 rdma_flags; }; <CODE ENDS>
The following prefix structure appears at the start of any RPC-over-RDMA version 2 transport header.
The rdma_flags is new to RPC-over-RDMA version 2. Currently, the only flag defined within this word is the RPCRDMA2_F_RESPONSE flag. The other bits are reserved for future use as described in Section 9.4. The sender MUST set these to zero.
The RPCRDMA2_F_RESPONSE flag qualifies the values contained in the transport header's rdma_start.rdma_xid and rdma_start.rdma_credits fields. The RPCRDMA2_F_RESPONSE flag enables a receiver to reliably avoid performing an XID lookup on incoming reverse direction Call messages, and apply the value of the rdma_start.rdma_credits field correctly, based on the direction of the message being conveyed.
In general, when a message carries an XID that was generated by the message's receiver (that is, the receiver is acting as a requester), the message's sender sets the RPCRDMA2_F_RESPONSE flag. Otherwise that flag is clear. For example:
<CODE BEGINS> struct rpcrdma2_chunk_lists { uint32 rdma_inv_handle; struct rpcrdma2_read_list *rdma_reads; struct rpcrdma2_write_list *rdma_writes; struct rpcrdma2_write_chunk *rdma_reply; }; <CODE ENDS>
The rpcrdma2_chunk_lists structure specifies how an RPC message is conveyed using explicit RDMA operations.
For the most part this structure parallels its RPC-over-RDMA version 1 equivalent. That is, rdma_reads, rdma_writes, rdma_reply provide, respectively, descriptions of the chunks used to read a Long request or directly placed data from the requester, to write directly placed response data into the requester's memory, and to write a long reply into the requester's memory.
An important addition relative to the corresponding RPC-over-RDMA version 1 rdma_header structures is the rdma_inv_handle field. This field supports remote invalidation of requester memory registrations via the RDMA Send With Invalidate operation.
To request Remote Invalidation, a requester sets the value of the rdma_inv_handle field in an RPC Call's transport header to a non-zero value that matches one of the rdma_handle fields in that header. If none of the rdma_handle values in the header conveying the Call may be invalidated by the responder, the requester sets the RPC Call's rdma_inv_handle field to the value zero.
If the responder chooses not to use remote invalidation for this particular RPC Reply, or the RPC Call's rdma_inv_handle field contains the value zero, the responder uses RDMA Send to transmit the matching RPC reply.
If a requester has provided a non-zero value in the RPC Call's rdma_inv_handle field and the responder chooses to use Remote Invalidation for the matching RPC Reply, the responder uses RDMA Send With Invalidate to transmit that RPC reply, and uses the value in the corresponding Call's rdma_inv_handle field to construct the Send With Invalidate Work Request.
RPC-over-RDMA version 2 provides a mechanism for connection endpoints to communicate information about implementation properties, enabling compatible endpoints to optimize data transfer. Initially only a small set of transport properties are defined and a single operation is provided to exchange transport properties (see Section 5.3.4).
Both the set of transport properties and the operations used to communicate may be extended. Within RPC-over-RDMA version 2, all such extensions are OPTIONAL. For information about existing transport properties, see Sections 4.1 through 4.2. For discussion of extensions to the set of transport properties, see Section 9.2.
A basic set of receiver and sender properties is specified in this document. An extensible approach is used, allowing new properties to be defined in future Standards Track documents.
Such properties are specified using:
<CODE BEGINS> typedef rpcrdma2_propid uint32; struct rpcrdma2_propval { rpcrdma2_propid rdma_which; opaque rdma_data<>; }; typedef rpcrdma2_propval rpcrdma2_propset<>; typedef uint32 rpcrdma2_propsubset<>; <CODE ENDS>
The following XDR types are used by operations that deal with transport properties:
An rpcrdma2_propid specifies a particular transport property. In order to facilitate XDR extension of the set of properties by concatenating XDR definition files, specific properties are defined as const values rather than as elements in an enum.
An rpcrdma2_propval specifies a value of a particular transport property with the particular property identified by rdma_which, while the associated value of that property is contained within rdma_data.
An rdma_data field which is of zero length is interpreted as indicating the default value or the property indicated by rdma_which.
While rdma_data is defined as opaque within the XDR, the contents are interpreted (except when of length zero) using the XDR typedef associated with the property specified by rdma_which. As a result, when rpcrdma2_propval does not conform to that typedef, the receiver is REQUIRED to return the error RDMA2_ERR_BAD_XDR using the header type RDMA2_ERROR as described in Section 5.3.3. For example, the receiver of a message containing a valid rpcrdma2_propval returns this error if the length of rdma_data is such that it extends beyond the bounds of the message being transferred.
In cases in which the rpcrdma2_propid specified by rdma_which is understood by the receiver, the receiver also MUST report the error RDMA2_ERR_BAD_XDR if either of the following occur:
Note that no error is to be reported if rdma_which is unknown to the receiver. In that case, that rpcrdma2_propval is not processed and processing continues using the next rpcrdma2_propval, if any.
A rpcrdma2_propset specifies a set of transport properties. No particular ordering of the rpcrdma2_propval items within it is imposed.
A rpcrdma2_propsubset identifies a subset of the properties in a previously specified rpcrdma2_propset. Each bit in the mask denotes a particular element in a previously specified rpcrdma2_propset. If a particular rpcrdma2_propval is at position N in the array, then bit number N mod 32 in word N div 32 specifies whether that particular rpcrdma2_propval is included in the defined subset. Words beyond the last one specified are treated as containing zero.
Although the set of transport properties may be extended, a basic set of transport properties is defined in Table 1.
In that table, the columns contain the following information:
Property | Code | XDR type | Default | Sec |
---|---|---|---|---|
Receive Buffer Size | 1 | uint32 | 4096 | Section 4.2.1 |
Reverse Request Support | 2 | enum rpcrdma2_rvreqsup | RDMA2_RVREQSUP_INLINE | Section 4.2.2 |
<CODE BEGINS> const uint32 RDMA2_PROPID_RBSIZ = 1; typedef uint32 rpcrdma2_prop_rbsiz; <CODE ENDS>
The Receive Buffer Size specifies the minimum size, in octets, of pre-posted receive buffers. It is the responsibility of the endpoint sending this value to ensure that its pre-posted receive buffers are at least the size specified, allowing the endpoint receiving this value to send messages that are of this size.
The sender may use his knowledge of the receiver's buffer size to determine when the message to be sent will fit in the preposted receive buffers that the receiver has set up. In particular,
<CODE BEGINS> enum rpcrdma2_rvreqsup { RDMA2_RVREQSUP_NONE = 0, RDMA2_RVREQSUP_INLINE = 1, RDMA2_RVREQSUP_GENL = 2 }; const uint32 RDMA2_PROPID_BRS = 2; typedef rpcrdma2_rvreqsup rpcrdma2_prop_brs; <CODE ENDS>
The value of this property is used to indicate a client implementation's readiness to accept and process messages that are part of reverse direction RPC requests.
Multiple levels of support are distinguished:
When information about this property is not provided, the support level of servers can be inferred from the reverse direction requests that they issue, assuming that issuing a request implicitly indicates support for receiving the corresponding reply. On this basis, support for receiving inline replies can be assumed when requests without Read chunks, Write chunks, or Reply chunks are issued, while requests with any of these elements allow the client to assume that general support for reverse direction replies is present on the server.
Each transport message consists of multiple sections:
This organization differs from that presented in the definition of RPC-over-RDMA version 1 [RFC8166], which presented the first and second of the items above as a single XDR item. The new organization is more in keeping with RPC-over-RDMA version 2's extensibility model in that new header types can be defined without modifying the existing set of header types.
The new header types within RPC-over-RDMA version 2 are set forth in Table 2. In that table, the columns contain the following information:
Operation | Code | XDR type | Msg | Sec |
---|---|---|---|---|
Convey Appended RPC Message | 0 | rpcrdma2_msg | Yes | Section 5.3.1 |
Convey External RPC Message | 1 | rpcrdma2_nomsg | No | Section 5.3.2 |
Report Transport Error | 4 | rpcrdma2_err | No | Section 5.3.3 |
Specify Properties at Connection | 5 | rpcrdma2_connprop | No | Section 5.3.4 |
Suppport for the operations in Table 2 is REQUIRED. Support for additional operations will be OPTIONAL. RPC-over-RDMA version 2 implementations that receive an OPTIONAL operation that is not supported MUST respond with an RDMA2_ERROR message with an error code of RDMA2_ERR_INVAL_HTYPE.
The header types defined and used in RPC-over-RDMA version 1 are all carried over into RPC-over-RDMA version 2, although there may be limited changes in the definition of existing header types.
In comparison with the header types of RPC-over-RDMA version 1, the changes can be summarized as follows:
<CODE BEGINS> const rpcrdma2_proc RDMA2_MSG = 0; struct rpcrdma2_msg { struct rpcrdma2_chunk_lists rdma_chunks; /* The rpc message starts here and continues * through the end of the transmission. */ uint32 rdma_rpc_first_word; }; <CODE ENDS>
RDMA2_MSG is used to convey an RPC message that immediately follows the Transport Header in the Send buffer. This is either an RPC request that has no Position-Zero Read chunk or an RPC reply that is not sent using a Reply chunk.
<CODE BEGINS> const rpcrdma2_proc RDMA2_NOMSG = 1; struct rpcrdma2_nomsg { struct rpcrdma2_chunk_lists rdma_chunks; }; <CODE ENDS>
RDMA2_NOMSG is used to convey an entire RPC message using explicit RDMA operations. Usually this is because the RPC message does not fit within the size limits that result from the receiver's inline threshold. The message may be a Long request, which is read from a memory area specified by a Position-Zero Read chunk; or a Long reply, which is written into a memory area specified by a Reply chunk.
<CODE BEGINS> const rpcrdma2_proc RDMA2_ERROR = 4; struct rpcrdma2_err_vers { uint32 rdma_vers_low; uint32 rdma_vers_high; }; struct rpcrdma2_err_write { uint32 rdma_chunk_index; uint32 rdma_length_needed; }; union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { case RDMA2_ERR_VERS: rpcrdma2_err_vers rdma_vrange; case RDMA2_ERR_READ_CHUNKS: uint32 rdma_max_chunks; case RDMA2_ERR_WRITE_CHUNKS: uint32 rdma_max_chunks; case RDMA2_ERR_SEGMENTS: uint32 rdma_max_segments; case RDMA2_ERR_WRITE_RESOURCE: rpcrdma2_err_write rdma_writeres; case RDMA2_ERR_REPLY_RESOURCE: uint32 rdma_length_needed; default: void; }; <CODE ENDS>
RDMA2_ERROR provides a way of reporting the occurrence of transport errors on a previous transmission. This header type MUST NOT be transmitted by a requester. [ cel: how is the XID field set when sending an error report from a requester, or when the error occurred on a non-RPC message? ]
Error reporting is addressed in RPC-over-RDMA version 2 in a fashion similar to RPC-over-RDMA version 1. Several new error codes, and error messages never flow from requester to responder. RPC-over-RDMA version 1 error reporting is described in Section 5 of [RFC8166].
In all cases below, the responder copies the values of the rdma_start.rdma_xid and rdma_start.rdma_vers fields from the incoming transport header that generated the error to transport header of the error response. The responder sets the rdma_start.rdma_htype field of the transport header prefix to RDMA2_ERROR, and the rdma_start.rdma_credit field is set to the credit grant value for this connection. The receiver of this header type MUST ignore the value of the rdma_start.rdma_credits field.
The RDMA2_CONNPROP message type allows an RPC-over-RDMA endpoint, whether client or server, to indicate to its partner relevant transport properties that the partner might need to be aware of.
<CODE BEGINS> struct rpcrdma2_connprop { rpcrdma2_propset rdma_props; }; <CODE ENDS>
The message definition for this operation is as follows:
All relevant transport properties that the sender is aware of should be included in rdma_props. Since support of each of the properties is OPTIONAL, the sender cannot assume that the receiver will necessarily take note of these properties. The sender should be prepared for cases in which the receiver continues to assume that the default value for a particular property is still in effect.
Generally, a participant will send a RDMA2_CONNPROP message as the first message after a connection is established. Given that fact, the sender should make sure that the message can be received by peers who use the default Receive Buffer Size. The connection's initial receive buffer size is typically 1KB, but it depends on the initial connection state of the RPC-over-RDMA version in use.
Properties not included in rdma_props are to be treated by the peer endpoint as having the default value and are not allowed to change subsequently. The peer should not request changes in such properties.
Those receiving an RDMA2_CONNPROP may encounter properties that they do not support or are unaware of. In such cases, these properties are simply ignored without any error response being generated.
This section contains a description of the core features of the RPC-over-RDMA version 2 protocol expressed in the XDR language [RFC4506].
Because of the need to provide for protocol extensibility without modifying an existing XDR definition, this description has some important structural differences from the corresponding XDR description for RPC-over-RDMA version 1, which appears in [RFC8166].
This description is divided into three parts:
This description is provided in a way that makes it simple to extract into ready-to-compile form. To enable the combination of this description with the descriptions of subsequent extensions to RPC-over-RDMA version 2, the extracted description can be combined with similar descriptions published later, or those descriptions can be compiled separately. Refer to Section 6.2 for details.
<CODE BEGINS> /// /* /// * Copyright (c) 2010-2018 IETF Trust and the persons /// * identified as authors of the code. All rights reserved. /// * /// * The authors of the code are: /// * B. Callaghan, T. Talpey, C. Lever, and D. Noveck. /// * /// * Redistribution and use in source and binary forms, with /// * or without modification, are permitted provided that the /// * following conditions are met: /// * /// * - Redistributions of source code must retain the above /// * copyright notice, this list of conditions and the /// * following disclaimer. /// * /// * - Redistributions in binary form must reproduce the above /// * copyright notice, this list of conditions and the /// * following disclaimer in the documentation and/or other /// * materials provided with the distribution. /// * /// * - Neither the name of Internet Society, IETF or IETF /// * Trust, nor the names of specific contributors, may be /// * used to endorse or promote products derived from this /// * software without specific prior written permission. /// * /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. /// */ /// <CODE ENDS>
Code components extracted from this document must include the following license text. When the extracted XDR code is combined with other complementary XDR code which itself has an identical license, only a single copy of the license text need be preserved.
<CODE BEGINS> sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' <CODE ENDS>
The reader can apply the following sed script to this document to produce a machine-readable XDR description of the RPC-over-RDMA version 2 protocol without any OPTIONAL extensions.
<CODE BEGINS> sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' \ < spec.txt > rpcrdma-v2.x <CODE ENDS>
That is, if this document is in a file called "spec.txt" then the reader can do the following to extract an XDR description file and store it in the file rpcrdma-v2.x.
<CODE BEGINS> #!/usr/local/bin/perl open(IN,"rpcrdma-v2.x"); open(OUT,">temp.x"); while(<IN>) { if (m/FILE ENDS: (.*)$/) { close(OUT); rename("temp.x", $1); open(OUT,">temp.x"); } else { print OUT $_; } } close(IN); close(OUT); <CODE ENDS>
Although this file is a usable description of the base protocol, when extensions are to supported, it may be desirable to divide into multiple files. The following script can be used for that purpose:
Running the above script will result in two files:
Optional extensions to RPC-over-RDMA version 2, published as Standards Track documents, will have similar means of providing XDR that describes those extensions. Once XDR for all desired extensions is also extracted, it can be appended to the XDR description file extracted from this document to produce a consolidated XDR description file reflecting all extensions selected for an RPC-over-RDMA implementation.
Alternatively, the XDR descriptions can be compiled separately. In this case the combination of common.x and baseops.x serves to define the base transport, while using as XDR descriptions for extensions, the XDR from the document defining that extension, together with the file common.x, obtained from this document.
<CODE BEGINS> /// /******************************************************************* /// * Transport Header Prefixes /// ******************************************************************/ /// /// struct rpcrdma_common { /// uint32 rdma_xid; /// uint32 rdma_vers; /// uint32 rdma_credit; /// uint32 rdma_htype; /// }; /// /// const RPCRDMA2_F_RESPONSE 0x00000001; /// /// struct rpcrdma2_hdr_prefix /// struct rpcrdma_common rdma_start; /// uint32 rdma_flags; /// }; /// /// /******************************************************************* /// * Chunks and Chunk Lists /// ******************************************************************/ /// /// struct rpcrdma2_segment { /// uint32 rdma_handle; /// uint32 rdma_length; /// uint64 rdma_offset; /// }; /// /// struct rpcrdma2_read_segment { /// uint32 rdma_position; /// struct rpcrdma2_segment rdma_target; /// }; /// /// struct rpcrdma2_read_list { /// struct rpcrdma2_read_segment rdma_entry; /// struct rpcrdma2_read_list *rdma_next; /// }; /// /// struct rpcrdma2_write_chunk { /// struct rpcrdma2_segment rdma_target<>; /// }; /// /// struct rpcrdma2_write_list { /// struct rpcrdma2_write_chunk rdma_entry; /// struct rpcrdma2_write_list *rdma_next; /// }; /// /// struct rpcrdma2_chunk_lists { /// uint32 rdma_inv_handle; /// struct rpcrdma2_read_list *rdma_reads; /// struct rpcrdma2_write_list *rdma_writes; /// struct rpcrdma2_write_chunk *rdma_reply; /// }; /// /// /******************************************************************* /// * Transport Properties /// ******************************************************************/ /// /// /* /// * Types for transport properties model /// */ /// typedef rpcrdma2_propid uint32; /// /// struct rpcrdma2_propval { /// rpcrdma2_propid rdma_which; /// opaque rdma_data<>; /// }; /// /// typedef rpcrdma2_propval rpcrdma2_propset<>; /// typedef uint32 rpcrdma2_propsubset<>; /// /// /* /// * Transport propid values for basic properties /// */ /// const uint32 RDMA2_PROPID_RBSIZ = 1; /// const uint32 RDMA2_PROPID_BRS = 2; /// /// /* /// * Types specific to particular properties /// */ /// typedef uint32 rpcrdma2_prop_rbsiz; /// typedef rpcrdma2_rvreqsup rpcrdma2_prop_brs; /// /// enum rpcrdma2_rvreqsup { /// RDMA2_RVREQSUP_NONE = 0, /// RDMA2_RVREQSUP_INLINE = 1, /// RDMA2_RVREQSUP_GENL = 2 /// }; /// /// /* FILE ENDS: common.x; */ <CODE ENDS>
<CODE BEGINS> /// /******************************************************************* /// * Descriptions of RPC-over-RDMA Header Types /// ******************************************************************/ /// /// /* /// * Header Type Codes. /// */ /// const rpcrdma2_proc RDMA2_MSG = 0; /// const rpcrdma2_proc RDMA2_NOMSG = 1; /// const rpcrdma2_proc RDMA2_ERROR = 4; /// const rpcrdma2_proc RDMA2_CONNPROP = 5; /// /// /* /// * Header Types to Convey RPC Messages. /// */ /// struct rpcrdma2_msg { /// struct rpcrdma2_chunk_lists rdma_chunks; /// /// /* The rpc message starts here and continues /// * through the end of the transmission. */ /// uint32 rdma_rpc_first_word; /// }; /// /// struct rpcrdma2_nomsg { /// struct rpcrdma2_chunk_lists rdma_chunks; /// }; /// /// /* /// * Header Type to Report Errors. /// */ /// const uint32 RDMA2_ERR_VERS = 1; /// const uint32 RDMA2_ERR_BAD_XDR = 2; /// const uint32 RDMA2_ERR_INVAL_HTYPE = 3; /// const uint32 RDMA2_ERR_READ_CHUNKS = 4; /// const uint32 RDMA2_ERR_WRITE_CHUNKS = 5; /// const uint32 RDMA2_ERR_SEGMENTS = 6; /// const uint32 RDMA2_ERR_WRITE_RESOURCE = 7; /// const uint32 RDMA2_ERR_REPLY_RESOURCE = 8; /// const uint32 RDMA2_ERR_SYSTEM = 9; /// /// struct rpcrdma2_err_vers { /// uint32 rdma_vers_low; /// uint32 rdma_vers_high; /// }; /// /// struct rpcrdma2_err_write { /// uint32 rdma_chunk_index; /// uint32 rdma_length_needed; /// }; /// /// union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { /// case RDMA2_ERR_VERS: /// rpcrdma2_err_vers rdma_vrange; /// case RDMA2_ERR_READ_CHUNKS: /// uint32 rdma_max_chunks; /// case RDMA2_ERR_WRITE_CHUNKS: /// uint32 rdma_max_chunks; /// case RDMA2_ERR_SEGMENTS: /// uint32 rdma_max_segments; /// case RDMA2_ERR_WRITE_RESOURCE: /// rpcrdma2_err_write rdma_writeres; /// case RDMA2_ERR_REPLY_RESOURCE: /// uint32 rdma_length_needed; /// default: /// void; /// }; /// /// /* /// * Header Type to Exchange Transport Properties. /// */ /// struct rpcrdma2_connprop { /// rpcrdma2_propset rdma_props; /// }; /// /// /* FILE ENDS: baseops.x; */ <CODE ENDS>
The three files common.x and baseops.x, when combined with the XDR descriptions for extension defined later, produce a human-readable and compilable description of the RPC-over-RDMA version 2 protocol with the included extensions.
Although this XDR description can be useful in generating code to encode and decode the transport and payload streams, there are elements of the structure of RPC-over-RDMA version 2 which are not expressible within the XDR language as currently defined. This requires implementations that use the output of the XDR processor to provide additional code to bridge the gaps.
To summarize, the role of XDR in this specification is more limited than for protocols which are themselves XDR programs, where the totality of the protocol is expressible within the XDR paradigm established for that purpose. This more limited role reflects the fact that XDR lacks facilities to represent the embedding of transported material within the transport framework. In addition, the need to cleanly accommodate extensions has meant that those using rpcgen in their applications need to take a more active role in providing the facilities that cannot be expressed within XDR.
When an RPC-over-RDMA version 2 client establishes a connection to a server, its first order of business is to determine the server's highest supported protocol version.
As with RPC-over-RDMA version 1, upon connection establishment a client MUST NOT send more than a single RPC-over-RDMA message at a time until it receives a valid non-error RPC-over-RDMA message from the server that grants client credits.
The second word of each transport header is used to convey the transport protocol version. In the interest of simplicity, we refer to that word as rdma_vers even though in the RPC-over-RDMA version 2 XDR definition it is described as rdma_start.rdma_vers.
First, the client sends a single valid RPC-over-RDMA message with the value two (2) in the rdma_vers field. Because the server might support only RPC-over-RDMA version 1, this initial message can be no larger than the version 1 default inline threshold of 1024 bytes.
If the server does support RPC-over-RDMA version 2, it sends RPC-over-RDMA messages back to the client with the value two (2) in the rdma_vers field. Both peers may use the default inline threshold value for RPC-over-RDMA version 2 connections (4096 bytes).
If the server does not support RPC-over-RDMA version 2, it MUST send an RPC-over-RDMA message to the client with the same XID, with RDMA2_ERROR in the rdma_start.rdma_htype field, and with the error code RDMA2_ERR_VERS. This message also reports a range of protocol versions that the server supports. To continue operation, the client selects a protocol version in the range of server-supported versions for subsequent messages on this connection.
If the connection is lost immediately after an RDMA2_ERROR / RDMA2_ERR_VERS message is received, a client can avoid a possible version negotiation loop when re-establishing another connection by assuming that particular server does not support RPC-over-RDMA version 2. A client can assume the same situation (no server support for RPC-over-RDMA version 2) if the initial negotiation message is lost or dropped. Once the negotiation exchange is complete, both peers may use the default inline threshold value for the transport protocol version that has been selected.
If the server supports the RPC-over-RDMA protocol version used in Call messages from a client, it MUST send Replies with the same RPC-over-RDMA protocol version that the client uses to send its Calls. The client MUST NOT change the version during the duration of the connection.
This section describes the substantive changes made in RPC-over-RDMA version 2, as opposed to the structural changes to enable extensibility, which are discussed in Section 10.1.
RPC-over-RDMA version 2 provides a mechanism for exchanging the transport's operational properties. This mechanism allows connection endpoints to communicate the properties of their implementation at connection setup. The mechanism could be expanded to enable an endpoint to request changes in properties of the other endpoint and to notify peer endpoints of changes to properties that occur during operation. Transport properties are described in Section 4.
RPC-over-RDMA transports employ credit-based flow control to ensure that a requester does not emit more RDMA Sends than the responder is prepared to receive. Section 3.3.1 of [RFC8166] explains the purpose and operation of RPC-over-RDMA version 1 credit management in detail.
In the RPC-over-RDMA version 1 design, each RDMA Send from a requester contains an RPC Call with a credit request, and each RDMA Send from a responder contains an RPC Reply with a credit grant. The credit grant implies that enough Receives have been posted on the responder to handle the credit grant minus the number of pending RPC transactions (the number of remaining Receive buffers might be zero).
In other words, each RPC Reply acts as an implicit ACK for a previous RPC Call from the requester, indicating that the responder has posted a Receive to replace the Receive consumed by the requester's RDMA Send. Without an RPC Reply message, the requester has no way to know that the responder is properly prepared for subsequent RPC Calls.
Aside from being a bit of a layering violation, there are basic (but rare) cases where this arrangement is inadequate:
Typically, the connection must be replaced in these cases. This resets the credit accounting mechanism but has an undesirable impact on other ongoing RPC transactions on that connection.
Because credit management accompanies each RPC message, there is a strict one-to-one ratio between RDMA Send and RPC message. There are interesting use cases that might be enabled if this relationship were more flexible:
Bi-directional RPC operation also introduces an ambiguity. If the RPC-over-RDMA message does not carry an RPC message, then it is not possible to determine whether the sender is a requester or a responder, and thus whether the rdma_credit field contains a credit request or a credit grant.
A more sophisticated credit accounting mechanism is provided in RPC-over-RDMA version 2 in an attempt to address some of these shortcomings. This new mechanism is detailed in Section TBD.
The term "inline threshold" is defined in Section 3.3.2 of [RFC8166]. An "inline threshold" value is the largest message size (in octets) that can be conveyed on an RDMA connection using only RDMA Send and Receive. Each connection has two inline threshold values: one for messages flowing from client-to-server (referred to as the "client-to-server inline threshold") and one for messages flowing from server-to-client (referred to as the "server-to-client inline threshold"). Note that [RFC8166] uses somewhat different terminology. This is because it was written with only forward-direction RPC transactions in mind.
A connection's inline thresholds determine when RDMA Read or Write operations are required because the RPC message to be sent cannot be conveyed via a single RDMA Send and Receive pair. When an RPC message does not contain DDP-eligible data items, a requester prepares a Long Call or Reply to convey the whole RPC message using RDMA Read or Write operations.
RDMA Read and Write operations require that each data payload resides in a region of memory that is registered with the RNIC. When an RPC is complete, that region is invalidated, fencing it from the responder. Memory registration and invalidation typically have a latency cost that is insignificant compared to data handling costs. When a data payload is small, however, the cost of registering and invalidating the memory where the payload resides becomes a relatively significant part of total RPC latency. Therefore the most efficient operation of RPC-over-RDMA occurs when explicit RDMA Read and Write operations are used for large payloads, and are avoided for small payloads.
When RPC-over-RDMA version 1 was conceived, the typical size of RPC messages that did not involve a significant data payload was under 500 bytes. A 1024-byte inline threshold adequately minimized the frequency of inefficient Long Calls and Replies.
With NFS version 4.1 [RFC5661], the increased size of NFS COMPOUND operations resulted in RPC messages that are on average larger and more complex than previous versions of NFS. With 1024-byte inline thresholds, RDMA Read or Write operations are needed for frequent operations that do not bear a data payload, such as GETATTR and LOOKUP, reducing the efficiency of the transport.
To reduce the need to use Long Calls and Replies, RPC-over-RDMA version 2 increases the default size of inline thresholds. This also increases the maximum size of reverse-direction RPC messages.
An STag that is registered using the FRWR mechanism in a privileged execution context or is registered via a Memory Window in an unprivileged context may be invalidated remotely [RFC5040]. These mechanisms are available when a requester's RNIC supports MEM_MGT_EXTENSIONS.
For the purposes of this discussion, there are two classes of STags. Dynamically-registered STags are used in a single RPC, then invalidated. Persistently-registered STags live longer than one RPC. They may persist for the life of an RPC-over-RDMA connection, or longer.
An RPC-over-RDMA requester may provide more than one STag in one transport header. It may provide a combination of dynamically- and persistently-registered STags in one RPC message, or any combination of these in a series of RPCs on the same connection. Only dynamically-registered STags using Memory Windows or FRWR (i.e., registered via MEM_MGT_EXTENSIONS) may be invalidated remotely.
There is no transport-level mechanism by which a responder can determine how a requester-provided STag was registered, nor whether it is eligible to be invalidated remotely. A requester that mixes persistently- and dynamically-registered STags in one RPC, or mixes them across RPCs on the same connection, must therefore indicate which handles may be invalidated via a mechanism provided in the Upper Layer Protocol. RPC-over-RDMA version 2 provides such a mechanism.
The RDMA Send With Invalidate operation is used to invalidate an STag on a remote system. It is available only when a responder's RNIC supports MEM_MGT_EXTENSIONS, and must be utilized only when a requester's RNIC supports MEM_MGT_EXTENSIONS (can receive and recognize an IETH).
Existing RPC-over-RDMA transport protocol specifications [RFC8166] [RFC8167] do not forbid direct data placement in the reverse direction, even though there is currently no Upper Layer Protocol that makes data items in reverse direction operations elegible for direct data placement.
When chunks are present in a reverse direction RPC request, Remote Invalidation allows the responder to trigger invalidation of a requester's STags as part of sending a reply, the same way as is done in the forward direction.
However, in the reverse direction, the server acts as the requester, and the client is the responder. The server's RNIC, therefore, must support receiving an IETH, and the server must have registered the STags with an appropriate registration mechanism.
RPC-over-RDMA version 2 expands the repertoire of errors that may be reported by connection endpoints. This change, which is structured to enable extensibility, allows a peer to report overruns of specific resources and to avoid requester retries when an error is permanent.
RPC-over-RDMA version 2 is designed to be extensible in a way that enables the addition of OPTIONAL features that may subsequently be converted to REQUIRED status in a future protocol version. The protocol may be extended by Standards Track documents in a way analogous to that provided for Network File System Version 4 as described in [RFC8178].
This form of extensibility enables limited extensions to the base RPC-over-RDMA version 2 protocol presented in this document so that new optional capabilities can be introduced without a protocol version change, while maintaining robust interoperability with existing RPC-over-RDMA version 2 implementations. The design allows extensions to be defined, including the definition of new protocol elements, without requiring modification or recompilation of the existing XDR.
A Standards Track document introduces each set of such protocol elements. Together these elements are considered an OPTIONAL feature. Each implementation is either aware of all the protocol elements introduced by that feature or is aware of none of them.
Documents describing extensions to RPC-over-RDMA version 2 should contain:
Implementers combine the XDR descriptions of the new features they intend to use with the XDR description of the base protocol in this document. This may be necessary to create a valid XDR input file because extensions are free to use XDR types defined in the base protocol, and later extensions may use types defined by earlier extensions.
The XDR description for the RPC-over-RDMA version 2 base protocol combined with that for any selected extensions should provide an adequate human-readable description of the extended protocol.
The base protocol specified in this document may be extended within RPC-over-RDMA version 2 in two ways:
The following sorts of ancillary protocol elements may be added to the protocol to support the addition of new transport properties and header types.
New capabilities can be proposed and developed independently of each other, and implementers can choose among them. This makes it straightforward to create and document experimental features and then bring them through the standards process.
New transport header types are to defined in a manner similar to the way existing ones are described in Sections Section 5.3.1 through Section 5.3.4 Specifically what is needed is:
In addition, there needs to be additional documentation that is made necessary due to the Optional status of new transport header types.
The set of transport properties is designed to be extensible. As a result, once new properties are defined in standards track documents, the operations defined in this document may reference these new transport properties, as well as the ones described in this document.
A standards track document defining a new transport property should include the following information paralleling that provided in this document for the transport properties defined herein.
The definition of transport property structures is such as to make it easy to assign unique values. There is no requirement that a continuous set of values be used and implementations should not rely on all such values being small integers. A unique value should be selected when the defining document is first published as an internet draft. When the document becomes a standards track document, the working group should ensure that:
Documents defining new properties fall into a number of categories.
When additional transport properties are proposed, the review of the associated standards track document should deal with possible security issues raised by those new transport properties.
New error codes to be returned when using new header types may be introduced in the same Standards Track document that defines the new header type. [ cel: what about adding a new error code that is returned for an existing header type? ]
For error codes that do not require that additional error information be returned with them, the existing RDMA_ERR2 header can be used to report the new error. The new error code is set as the value of rdma_err with the result that the default switch arm of the rpcrdma2_error (i.e. void) is selected.
For error codes that do require the return of additional error-related information together with the error, a new header type should be defined for the purpose of returning the error together with needed additional information. It should be documented just like any other new header type.
When a new header type is sent, the sender needs to be prepared to accept header types necessary to report associated errors.
There are currently thirty-one flags available for later assignment. One possible use for such flags would be in a later protocol version, should that version retain the same general header structure as version 2.
In addition, it is possible to assign unused flags within extensions made to version 2, as long as the following practices are adhered to:
In addition to the substantive protocol changes discussed in Section 8, there are a number of structural XDR changes whose goal is to enable within-version protocol extensibility.
The RPC-over-RDMA version 1 transport header is defined as a single XDR object, with an RPC message proper potentially following it. In RPC-over-RDMA version 2, as described in Section 5.1 there are separate XDR definitions of the transport header prefix (see Section 3.2 which specifies the transport header type to be used, and the specific transport header, defined within one of the subsections of Section 5). This is similar to the way that an RPC message consists of an RPC header (defined in [RFC5531]) and an RPC request or reply, defined by the Upper Layer protocol being conveyed.
As a new version of the RPC-over-RDMA transport protocol, RPC-over-RDMA version 2 exists within the versioning rules defined in [RFC8166]. In particular, it maintains the first four words of the protocol header as sent and received, as specified in Section 4.2 of [RFC8166], even though, as explained in Section 3.1 of this document, the XDR definition of those words is structured differently.
Although each of the first four words retains its semantic function, there are important differences of field interpretation, besides the fact that the words have different names and different roles with the XDR constrict of they are parts.
Beyond conforming to the restrictions specified in [RFC8166], RPC-over-RDMA version 2 tightly limits the scope of the changes made in order to ensure interoperability. It makes no major structural changes to the protocol, and all existing transport header types used in version 1 (as defined in [RFC8166]) are retained in version 2. Chunks are expressed using the same on-the-wire format and are used in the same way in both versions.
Subsequent RPC-over-RDMA versions are free to change the protocol in any way they choose as long as they maintain the first four header words as currently specified by [RFC8166].
Such changes might involve deletion or major re-organization of existing transport headers. However, the need for interoperability between adjacent versions will often limit the scope of changes that can be made in a single version.
In some cases it may prove desirable to transition to a new version by using the extension features described for use with RPC-over-RDMA version 2, by continuing the same basic extension model but allowing header types and properties that were OPTIONAL in one version to become REQUIRED in the subsequent version.
The security considerations for RPC-over-RDMA version 2 are the same as those for RPC-over-RDMA version 1.
Like other fields that appear in each RPC-over-RDMA header, property information is sent in the clear on the fabric with no integrity protection, making it vulnerable to man-in-the-middle attacks.
For example, if a man-in-the-middle were to change the value of the Receive buffer size or the Requester Remote Invalidation boolean, it could reduce connection performance or trigger loss of connection. Repeated connection loss can impact performance or even prevent a new connection from being established. Recourse is to deploy on a private network or use link-layer encryption.
This document does not require actions by IANA.
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997. |
[RFC4506] | Eisler, M., "XDR: External Data Representation Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 2006. |
[RFC5531] | Thurlow, R., "RPC: Remote Procedure Call Protocol Specification Version 2", RFC 5531, DOI 10.17487/RFC5531, May 2009. |
[RFC8166] | Lever, C., Simpson, W. and T. Talpey, "Remote Direct Memory Access Transport for Remote Procedure Call Version 1", RFC 8166, DOI 10.17487/RFC8166, June 2017. |
[RFC8174] | Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017. |
The authors gratefully acknowledge the work of Brent Callaghan and Tom Talpey on the original RPC-over-RDMA version 1 specification (RFC 5666). The authors also wish to thank Bill Baker, Greg Marsden, and Matt Benjamin for their support of this work.
The XDR extraction conventions were first described by the authors of the NFS version 4.1 XDR specification [RFC5662]. Herbert van den Bergh suggested the replacement sed script used in this document.
Special thanks go to Transport Area Director Spencer Dawkins, NFSV4 Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4 Working Group Secretary Thomas Haynes for their support.