Network File System Version 4 | C. Lever, Ed. |
Internet-Draft | Oracle |
Intended status: Standards Track | D. Noveck |
Expires: May 20, 2020 | NetApp |
November 17, 2019 |
RPC-over-RDMA Version 2 Protocol
draft-ietf-nfsv4-rpcrdma-version-two-00
This document specifies the second version of a protocol that conveys Remote Procedure Call (RPC) messages on transports capable of Remote Direct Memory Access (RDMA). This version of the protocol is extensible.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 20, 2020.
Copyright (c) 2019 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English.
Remote Direct Memory Access (RDMA) [RFC5040] [RFC5041] [IBA] is a technique for moving data efficiently between network nodes. By directing data into destination buffers as it is sent on a network and placing it using direct memory access implemented by hardware, the complementary benefits of faster transfers and reduced host overhead are obtained.
Open Network Computing Remote Procedure Call (ONC RPC, often shortened in NFSv4 documents to RPC) [RFC5531] is a Remote Procedure Call protocol that runs over a variety of transports. Most RPC implementations today use UDP [RFC0768] or TCP [RFC0793]. On UDP, RPC messages are encapsulated inside datagrams, while on a TCP byte stream, RPC messages are delineated by a record marking protocol. An RDMA transport also conveys RPC messages in a specific fashion that must be fully described if RPC implementations are to interoperate when using RDMA to transport RPC transactions.
RDMA transports present semantics that differ from either UDP or TCP. They retain message delineations like UDP but provide reliable and sequenced data transfer like TCP. They also provide an offloaded bulk transfer service not provided by UDP or TCP. RDMA transports are therefore appropriately treated as a new transport type by RPC.
Although the RDMA transport described herein can provide relatively transparent support for any RPC application, this document also describes mechanisms that enable further optimization of data transfer, when RPC applications are structured to exploit awareness of a transport's RDMA capability. In this context, the Network File System (NFS) protocols, as described in [RFC1094], [RFC1813], [RFC7530], [RFC5661], and subsequent NFSv4 minor versions, are all potential beneficiaries of RDMA transports. A complete problem statement is presented in [RFC5532].
The RPC-over-RDMA version 1 protocol specified in [RFC8166] is deployed and in use, although there are known shortcomings to this protocol:
To address these issues in a way that enables interoperation with existing RPC-over-RDMA version 1 deployments, a second version of the RPC-over-RDMA transport protocol is presented in this document.
Version 2 of RPC-over-RDMA is extensible, enabling OPTIONAL extensions to be added without impacting existing implementations. To enable protocol extension, the XDR definition for RPC-over-RDMA version 2 is organized differently than the definition version 1. These changes, which are discussed in Appendix C.1, do not alter the on-the-wire format.
In addition, RPC-over-RDMA version 2 contains a set of incremental changes that relieve certain performance constraints and enable recovery from abnormal corner cases. These changes are outlined in Appendix C and include a larger default inline threshold, the ability to convey a single RPC message using multiple RDMA Send operations, support for authentication of connection peers, richer error reporting, an improved credit-based flow control mechanism, and support for Remote Invalidation.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.
This section highlights key elements of the RPC protocol [RFC5531] and the External Data Representation (XDR) [RFC4506] used by it. RPC-over-RDMA version 2 enables the transmission of RPC messges built using XDR and also uses XDR internaly to describe its own header formats. An understanding of RPC and its use of XDR is assumed in this document.
RPCs are an abstraction used to implement the operations of an Upper-Layer Protocol (ULP). "ULP" refers to an RPC Program and Version tuple, which is a versioned set of procedure calls that comprise a single well-defined API. One example of a ULP is the Network File System Version 4.0 [RFC7530].
In this document, the term "RPC consumer" refers to an implementation of a ULP running on an RPC client.
Like a local procedure call, every RPC procedure has a set of "arguments" and a set of "results". A calling context invokes a procedure, passing arguments to it, and the procedure subsequently returns a set of results. Unlike a local procedure call, the called procedure is executed remotely rather than in the local application's execution context.
The RPC protocol as described in [RFC5531] is fundamentally a message-passing protocol between one or more clients, where RPC consumers are running, and a server, where a remote execution context is available to process RPC transactions on behalf of those consumers.
ONC RPC transactions are made up of two types of messages:
Each RPC client endpoint acts as a "Requester". It serializes the procedure's arguments and conveys them to a server endpoint via an RPC Call message. This message contains an RPC protocol header, a header describing the requested upper-layer operation, and all arguments.
An RPC server endpoint acts as a "Responder". It deserializes the arguments and processes the requested operation. It then serializes the operation's results into another byte stream. This byte stream is conveyed back to the Requester via an RPC Reply message. This message contains an RPC protocol header, a header describing the upper-layer reply, and all results.
The Requester deserializes the results and allows the RPC consumer to proceed. At this point, the RPC transaction designated by the XID in the RPC Call message is complete, and the XID is retired.
In summary, Requesters send RPC Call messages to Responders to initiate RPC transactions. Responders send RPC Reply messages to Requesters to complete the processing on an RPC transaction.
The role of an "RPC transport" is to mediate the exchange of RPC messages between Requesters and Responders. An RPC transport bridges the gap between the RPC message abstraction and the native operations of a particular network transport.
RPC-over-RDMA is a connection-oriented RPC transport. When a connection-oriented transport is used, clients initiate transport connections, while servers wait passively to accept incoming connection requests.
Most commonly, the client end of the connection acts in the role of Requester, and the server end of the connection acts as a Responder. However, RPC transactions can also be sent in the reverse direction. In this case, the server end of the connection acts as a Requestor while the client end acts as a Responder.
One cannot assume that all Requesters and Responders represent data objects the same way internally. RPC uses External Data Representation (XDR) to translate native data types and serialize arguments and results [RFC4506].
The XDR protocol encodes data independently of the endianness or size of host-native data types, enabling unambiguous decoding of data by the receiver. RPC Programs are specified by writing an XDR definition of their procedures, argument data types, and result data types.
XDR assumes only that the number of bits in a byte (octet) and their order are the same on both endpoints and on the physical network. The smallest indivisible unit of XDR encoding is a group of four octets. XDR can also flatten lists, arrays, and other complex data types so they can be conveyed as a stream of bytes.
A serialized stream of bytes that is the result of XDR encoding is referred to as an "XDR stream". A sending endpoint encodes native data into an XDR stream and then transmits that stream to a receiver. A receiving endpoint decodes incoming XDR byte streams into its native data representation format.
Sometimes, a data item is to be transferred as is: without encoding or decoding. The contents of such a data item are referred to as "opaque data". XDR encoding places the content of opaque data items directly into an XDR stream without altering it in any way. ULPs or applications perform any needed data translation in this case. Examples of opaque data items include the content of files or generic byte strings.
The number of octets in a variable-length data item precedes that item in an XDR stream. If the size of an encoded data item is not a multiple of four octets, octets containing zero are added after the end of the item. This is the case so that the next encoded data item in the XDR stream always starts on a four-octet boundary. The encoded size of the item is not changed by the addition of the extra octets. These extra octets are never exposed to ULPs.
This technique is referred to as "XDR roundup", and the extra octets are referred to as "XDR roundup padding".
RPC Requesters and Responders can be made more efficient if large RPC messages are transferred by a third party, such as intelligent network-interface hardware (data movement offload), and placed in the receiver's memory so that no additional adjustment of data alignment has to be made (direct data placement or "DDP"). RDMA transports enable both optimizations.
In the current document, "RDMA" refers to the physical mechanism an RDMA transport utilizes when moving data.
Typically, RPC implementations copy the contents of RPC messages into a buffer before being sent. An efficient RPC implementation sends bulk data without copying it into a separate send buffer first.
However, socket-based RPC implementations are often unable to receive data directly into its final place in memory. Receivers often need to copy incoming data to finish an RPC operation: sometimes, only to adjust data alignment.
Although it may not be efficient, before an RDMA transfer, a sender may copy data into an intermediate buffer. After an RDMA transfer, a receiver may copy that data again to its final destination. In this document, the term "DDP" refers to any optimized data transfer where it is unnecessary for a receiving host's CPU to copy transferred data to another location after it has been received.
RPC-over-RDMA version 2 enables the use of RDMA Read and Write operations to achieve both data movement offload and DDP. However, not all RDMA-based data transfer qualifies as DDP, and DDP can be achieved using non-RDMA mechanisms.
To achieve good performance during receive operations, RDMA transports require that RDMA consumers provision resources in advance in order to receive incoming messages.
An RDMA consumer might provide Receive buffers in advance by posting an RDMA Receive Work Request for every expected RDMA Send from a remote peer. These buffers are provided before the remote peer posts RDMA Send Work Requests. Thus this is often referred to as "pre-posting" buffers.
An RDMA Receive Work Request remains outstanding until hardware matches it to an inbound Send operation. The resources associated with that Receive must be retained in host memory, or "pinned", until the Receive completes.
Given these basic tenets of RDMA transport operation, the RPC-over-RDMA version 2 protocol assumes each transport provides the following abstract operations. A more complete discussion of these operations can be found in [RFC5040].
Memory registration assigns a steering tag to a region of memory, permitting the RDMA provider to perform data-transfer operations. The RPC-over-RDMA version 2 protocol assumes that each registered memory region is identified with a steering tag of no more than 32 bits and memory addresses of up to 64 bits in length.
The RDMA provider supports an RDMA Send operation, with completion signaled on the receiving peer after data has been placed in a pre-posted buffer. Sends complete at the receiver in the order they were issued at the sender. The amount of data transferred by a single RDMA Send operation is limited by the size of the remote peer's pre-posted buffers.
The RDMA provider supports an RDMA Receive operation to receive data conveyed by incoming RDMA Send operations. To reduce the amount of memory that must remain pinned awaiting incoming Sends, the amount of pre-posted memory is limited. Flow control to prevent overrunning receiver resources is provided by the RDMA consumer (in this case, the RPC-over-RDMA version 2 protocol).
The RDMA provider supports an RDMA Write operation to place data directly into a remote memory region. The local host initiates an RDMA Write, and completion is signaled there. No completion is signaled on the remote peer. The local host provides a steering tag, memory address, and the length of the remote peer's memory region.
RDMA Writes are not ordered with respect to one another, but are ordered with respect to RDMA Sends. A subsequent RDMA Send completion obtained at the write initiator guarantees that prior RDMA Write data has been successfully placed in the remote peer's memory.
The RDMA provider supports an RDMA Read operation to place peer source data directly into the read initiator's memory. The local host initiates an RDMA Read, and completion is signaled there. No completion is signaled on the remote peer. The local host provides steering tags, memory addresses, and a length for the remote source and local destination memory region.
The local host signals Read completion to the remote peer as part of a subsequent RDMA Send message. The remote peer can then invalidate steering tags and subsequently free associated source memory regions.
A "transfer model" designates which endpoint exposes its memory and which is responsible for initiating the transfer of data. To enable RDMA Read and Write operations, for example, an endpoint first exposes regions of its memory to a remote endpoint, which initiates these operations against the exposed memory.
In RPC-over-RDMA version 2, Requesters expose their memory to the Responder, but the Responder does not expose its memory. The Responder pulls RPC arguments or whole RPC calls from each Requester. The Responder pushes RPC results or whole RPC replies to each Requester.
Each RPC-over-RDMA version 2 message consists of at most two XDR streams:
In its simplest form, an RPC-over-RDMA version 2 message conveying an RPC message payload consists of a Transport stream followed immediately by a Payload stream transmitted together via a single RDMA Send.
RPC-over-RDMA framing replaces all other RPC framing (such as TCP record marking) when used atop an RPC-over-RDMA association, even when the underlying RDMA protocol may itself be layered atop a transport with a defined RPC framing (such as TCP).
However, it is possible for RPC-over-RDMA to be dynamically enabled on a connection in the course of negotiating the use of RDMA via a ULP exchange. Because RPC framing delimits an entire RPC request or reply, the resulting shift in framing must occur between distinct RPC messages, and in concert with the underlying transport.
The longevity of an RDMA connection mandates that sending endpoints respect the resource limits of peer receivers. To ensure messages can be sent and received reliably, there are two operational parameters for each connection. It is critical to provide RDMA Send flow control for an RDMA connection. If any pre-posted Receive buffer on the connection is not large enough to accept an incoming RDMA Send, or if a pre-posted Receive buffer is not available to accept an incoming RDMA Send, the RDMA connection can be terminated.
Because RPC-over-RDMA requires reliable and in-order delivery of data payloads, RPC-over-RDMA transports MUST use the RDMA RC (Reliable Connected) Queue Pair (QP) type, which ensures in-transit data integrity and handles recovery from packet loss or misordering.
However, RPC-over-RDMA transports provide their own flow control mechanism to prevent a sender from overwhelming receiver resources. RPC-over-RDMA transports employ an end-to-end credit-based flow control mechanism for this purpose [CBFC]. Credit-based flow control was chosen because it is relatively simple, provides robust operation in the face of bursty traffic, automated management of receive buffer allocation, and excellent buffer utilization.
An RPC-over-RDMA version 2 credit is the capability to receive one RPC-over-RDMA version 2 message. This enables RPC-over-RDMA version 2 to support asymmetrical operation, where a message in one direction might be matched by zero, one, or multiple messages in the other direction.
To achieve this, credits are assigned to each connection peer's posted Receive buffers. Each Requester has a set of Receive credits, and each Responder has a set of Receive credits. These credit values are managed independently of one another.
Section 7 of [RFC8166] requires that the 32-bit field containing the credit grant is the third word in the transport header. To conform with that requirement, the two independent credit values are encoded into a single 32-bit field in the fixed portion of the transport header. After the field is XDR decoded, the receiver takes the low-order two bytes as the number of credits that are newly granted by the sender, and the high-order two bytes as the maximum number of credits that can be outstanding at the sender.
In this approach, then, there are requester credits, sent in messages from the requester to the responder; and responder credits, sent in messages from the responder to the requester.
A sender MUST NOT send RDMA messages in excess of the receiver's granted credit limit. If the granted value is exceeded, the RDMA layer may signal an error, possibly terminating the connection. The granted value MUST NOT be zero, since such a value would result in deadlock.
The granted credit values MAY be adjusted to match the needs or policies in effect on either peer. For instance, a peer may reduce its granted credit value to accommodate the available resources in a Shared Receive Queue.
Certain RDMA implementations may impose additional flow-control restrictions, such as limits on RDMA Read operations in progress at the Responder. Accommodation of such restrictions is considered the responsibility of each RPC-over-RDMA version 2 implementation.
A protocol convention is provided to enable one peer to refresh its credit grant to the other peer without sending a data payload. Messages of this type can also act as a keep-alive ping. See Section 6.4.2 for information about this convention.
To prevent transport deadlock, receivers MUST always be in a position to receive one such credit grant update message, in addition to payload-bearing messages. One way a receiver can do this is to post one extra Receive more than the credit value it granted.
An "inline threshold" value is the largest message size (in octets) that can be conveyed in one direction between peer implementations using RDMA Send and Receive operations. The inline threshold value is effectively the smaller of the largest number of bytes the sender can post via a single RDMA Send operation and the largest number of bytes the receiver can accept via a single RDMA Receive operation. Each connection has two inline threshold values: one for messages flowing from Requester-to-Responder, referred to as the "call inline threshold", and one for messages flowing from Responder-to-Requester, referred to as the "reply inline threshold". Inline threshold values can be advertised to peers via Transport Properties.
Receiver implementations MUST support inline thresholds of 4096 bytes. In the absence of an exchange of Transport Properties, senders and receivers MUST assume both connection inline thresholds are 4096 bytes.
When an RPC-over-RDMA version 2 client establishes a connection to a server, its first order of business is to determine the server's highest supported protocol version.
Upon connection establishment a client MUST NOT send more than a single RPC-over-RDMA message at a time until it receives a valid non-error RPC-over-RDMA message from the server that grants client credits.
The second word of each transport header is used to convey the transport protocol version. In the interest of simplicity, we refer to that word as rdma_vers even though in the RPC-over-RDMA version 2 XDR definition it is described as rdma_start.rdma_vers.
First, the client sends a single valid RPC-over-RDMA message with the value two (2) in the rdma_vers field. Because the server might support only RPC-over-RDMA version 1, this initial message MUST NOT be larger than the version 1 default inline threshold of 1024 bytes.
If the server does support RPC-over-RDMA version 2, it sends RPC-over-RDMA messages back to the client with the value two (2) in the rdma_vers field. Both peers may use the default inline threshold value for RPC-over-RDMA version 2 connections (4096 bytes).
If the server does not support RPC-over-RDMA version 2, it MUST send an RPC-over-RDMA message to the client with the same XID, with RDMA2_ERROR in the rdma_start.rdma_htype field, and with the error code RDMA2_ERR_VERS. This message also reports a range of protocol versions that the server supports. To continue operation, the client selects a protocol version in the range of server-supported versions for subsequent messages on this connection.
If the connection is lost immediately after an RDMA2_ERROR / RDMA2_ERR_VERS message is received, a client can avoid a possible version negotiation loop when re-establishing another connection by assuming that particular server does not support RPC-over-RDMA version 2. A client can assume the same situation (no server support for RPC-over-RDMA version 2) if the initial negotiation message is lost or dropped. Once the negotiation exchange is complete, both peers may use the default inline threshold value for the transport protocol version that has been selected.
If the server supports the RPC-over-RDMA protocol version used in the first RPC-over-RDMA message received from a client, it MUST use that protocol version in all subsequent messages it sends on that connection. The client MUST NOT change the protocol version for the duration of the connection.
When a DDP capability is available, the transport places the contents of one or more XDR data items directly into the receiver's memory, separately from the transfer of other parts of the containing XDR stream.
RPC-over-RDMA version 2 provides a mechanism for moving part of an RPC message via a data transfer distinct from an RDMA Send/Receive pair. The sender removes one or more XDR data items from the Payload stream. These items are conveyed via other mechanisms, such as one or more RDMA Read or Write operations. As the receiver decodes an incoming message, it skips over directly placed data items.
The portion of an XDR stream that is split out and moved separately is referred to as a "chunk". In some contexts, data in an RPC-over-RDMA header that describes these split out regions of memory may also be referred to as a "chunk".
A Payload stream after chunks have been removed is referred to as a "reduced" Payload stream. Likewise, a data item that has been removed from a Payload stream to be transferred separately is referred to as a "reduced" data item.
Not all XDR data items benefit from DDP. For example, small data items or data items that require XDR unmarshaling by the receiver do not benefit from DDP. In addition, it is impractical for receivers to prepare for every possible XDR data item in a protocol to be transferred in a chunk.
To maintain practical interoperability on an RPC-over-RDMA transport, a determination must be made of which few XDR data items in each ULP are allowed to use DDP.
This is done in additional specifications that describe how ULPs employ DDP. A "ULB specification" identifies which specific individual XDR data items in a ULP MAY be transferred via DDP. Such data items are referred to as "DDP-eligible". All other XDR data items MUST NOT be reduced. Detailed requirements for ULBs are provided in Appendix A.
When encoding a Payload stream that contains a DDP-eligible data item, a sender may choose to reduce that data item. When it chooses to do so, the sender does not place the item into the Payload stream. Instead, the sender records in the RPC-over-RDMA Transport header the location and size of the memory region containing that data item.
The Requester provides location information for DDP-eligible data items in both RPC Call and Reply messages. The Responder uses this information to retrieve arguments contained in the specified region of the Requester's memory or place results in that memory region.
An "RDMA segment", or "plain segment", is an RPC-over-RDMA Transport header data object that contains the precise coordinates of a contiguous memory region that is to be conveyed separately from the Payload stream. Plain segments contain the following information: [RFC5040] for further discussion.
See
In RPC-over-RDMA version 2, a "chunk" refers to a portion of the Payload stream that is moved independently of the RPC-over-RDMA Transport header and Payload stream. Chunk data is removed from the sender's Payload stream, transferred via separate operations, and then reinserted into the receiver's Payload stream to form a complete RPC message.
Each chunk is comprised of RDMA segments. Each RDMA segment represents a single contiguous piece of that chunk. A Requester MAY divide a chunk into RDMA segments using any boundaries that are convenient. The length of a chunk is exactly the sum of the lengths of the RDMA segments that comprise it.
The RPC-over-RDMA version 2 transport protocol does not place a limit on chunk size. However, each ULP may cap the amount of data that can be transferred by a single RPC transaction. For example, NFS has "rsize" and "wsize", which restrict the payload size of NFS READ and WRITE operations. The Responder can use such limits to sanity check chunk sizes before using them in RDMA operations.
If a chunk contains a counted array data type, the count of array elements MUST remain in the Payload stream, while the array elements MUST be moved to the chunk. For example, when encoding an opaque byte array as a chunk, the count of bytes stays in the Payload stream, while the bytes in the array are removed from the Payload stream and transferred within the chunk.
Individual array elements appear in a chunk in their entirety. For example, when encoding an array of arrays as a chunk, the count of items in the enclosing array stays in the Payload stream, but each enclosed array, including its item count, is transferred as part of the chunk.
If a chunk contains an optional-data data type, the "is present" field MUST remain in the Payload stream, while the data, if present, MUST be moved to the chunk.
A union data type MUST NOT be made DDP-eligible, but one or more of its arms MAY be DDP-eligible, subject to the other requirements in this section.
Except in special cases (covered in Section 4.5.4), a chunk MUST contain exactly one XDR data item. This makes it straightforward to reduce variable-length data items without affecting the XDR alignment of data items in the Payload stream.
When a variable-length XDR data item is reduced, the sender MUST remove XDR roundup padding for that data item from the Payload stream so that data items remaining in the Payload stream begin on four-byte alignment.
A "Read chunk" represents an XDR data item that is to be pulled from the Requester to the Responder. A Read chunk is a list of one or more RDMA read segments. Each RDMA read segment consists of a Position field followed by a plain segment.
While constructing an RPC Call message, a Requester registers memory regions that contain data to be transferred via RDMA Read operations. It advertises the coordinates of these regions in the RPC-over-RDMA Transport header of the RPC Call message.
After receiving an RPC Call message sent via an RDMA Send operation, a Responder transfers the chunk data from the Requester using RDMA Read operations. The Responder reconstructs the transferred chunk data by concatenating the contents of each RDMA segment in list order into the received Payload stream at the Position value recorded in that RDMA segment.
Put another way, the Responder inserts the first RDMA segment in a Read chunk into the Payload stream at the byte offset indicated by its Position field. RDMA segments whose Position field value match this offset are concatenated afterwards, until there are no more RDMA segments at that Position value.
The Position field in a read segment indicates where the containing Read chunk starts in the Payload stream. The value in this field MUST be a multiple of four. All segments in the same Read chunk share the same Position value, even if one or more of the RDMA segments have a non-four-byte-aligned length.
While decoding a received Payload stream, whenever the XDR offset in the Payload stream matches that of a Read chunk, the Responder initiates an RDMA Read to pull the chunk's data content into registered local memory.
The Responder acknowledges its completion of use of Read chunk source buffers when it sends an RPC Reply message to the Requester. The Requester may then release Read chunks advertised in the request.
When reducing a variable-length argument data item, the Requester MUST NOT include the data item's XDR roundup padding in the chunk itself. The chunk's total length MUST be the same as the encoded length of the data item.
While constructing an RPC Call message, a Requester prepares memory regions in which to receive DDP-eligible result data items. A "Write chunk" represents an XDR data item that is to be pushed from a Responder to a Requester. It is made up of an array of zero or more plain segments.
Write chunks are provisioned by a Requester long before the Responder has prepared the reply Payload stream. A Requester often does not know the actual length of the result data items to be returned, since the result does not yet exist. Thus, it MUST register Write chunks long enough to accommodate the maximum possible size of each returned data item.
In addition, the XDR position of DDP-eligible data items in the reply's Payload stream is not predictable when a Requester constructs an RPC Call message. Therefore, RDMA segments in a Write chunk do not have a Position field.
For each Write chunk provided by a Requester, the Responder pushes one data item to the Requester, filling the chunk contiguously and in segment array order until that data item has been completely written to the Requester. The Responder MUST copy the segment count and all segments from the Requester-provided Write chunk into the RPC Reply message's Transport header. As it does so, the Responder updates each segment length field to reflect the actual amount of data that is being returned in that segment. The Responder then sends the RPC Reply message via an RDMA Send operation.
An "empty Write chunk" is a Write chunk with a zero segment count. By definition, the length of an empty Write chunk is zero. An "unused Write chunk" has a non-zero segment count, but all of its segments are empty segments.
After receiving the RPC Reply message, the Requester reconstructs the transferred data by concatenating the contents of each segment in array order into the RPC Reply message's XDR stream at the known XDR position of the associated DDP-eligible result data item.
When provisioning a Write chunk for a variable-length result data item, the Requester MUST NOT include additional space for XDR roundup padding. A Responder MUST NOT write XDR roundup padding into a Write chunk, even if the result is shorter than the available space in the chunk. Therefore, when returning a single variable-length result data item, a returned Write chunk's total length MUST be the same as the encoded length of the result data item.
A receiver of RDMA Send operations is required to have previously posted one or more adequately sized buffers. Memory savings are achieved on both Requesters and Responders by posting small Receive buffers. However, not all RPC messages are small. RPC-over-RDMA version 2 provides several mechanisms that enable RPC message payloads of any size to be conveyed efficiently.
RPC message payloads are often smaller than typical inline thresholds. For example, an NFS version 3 GETATTR operation is only 56 octets: 20 octets of RPC header, a 32-octet file handle argument, and 4 octets for its length. The reply to this common request is about 100 octets.
Since all RPC messages conveyed via RPC-over-RDMA version 2 require at least one RDMA Send operation, the most efficient way to send an RPC message that is smaller than the inline threshold is to append the Payload stream directly to the Transport stream. An RPC-over-RDMA header with a small RPC Call or Reply message immediately following is transferred using a single RDMA Send operation. No other operations are needed.
Requester Responder | RDMA Send (RDMA_MSG) | Call | ------------------------------> | | | | | Processing | | | RDMA Send (RDMA_MSG) | | <------------------------------ | Reply
An RPC-over-RDMA transaction using a Short Message:
If an RPC message is larger than the inline threshold, the sender can choose to split that message over multiple RPC-over-RDMA messages. The Payload stream of each RPC-over-RDMA message contains a part of the RPC message. The receiver reconstitutes the RPC message by concatenating the Payload streams of the sequence of RPC-over-RDMA messages together.
Though the purpose of a Continued Message is to handle large RPC messages, senders MAY use a Continued Message at any time to convey an RPC message, and MAY split the RPC message payload on any convenient boundary.
Requester Responder | RDMA Send (RDMA_MSG) | Call | ------------------------------> | | RDMA Send (RDMA_MSG) | | ------------------------------> | | RDMA Send (RDMA_MSG) | | ------------------------------> | | | | | | | Processing | | | RDMA Send (RDMA_MSG) | | <------------------------------ | Reply
An RPC-over-RDMA transaction using a Continued Message:
If DDP-eligible data items are present in a Payload stream, a sender MAY reduce some or all of these items by removing them from the Payload stream. The sender then uses a separate mechanism to transfer the reduced data items. The Transport stream with the reduced Payload stream immediately following is then transferred using a single RDMA Send operation.
After receiving the Transport and Payload streams of an RPC Call message accompanied by Read chunks, the Responder uses RDMA Read operations to move reduced data items in Read chunks. Before sending the Transport and Payload streams of an RPC Reply message containing Write chunks, the Responder uses RDMA Write operations to move reduced data items in Write and Reply chunks.
Requester Responder | RDMA Send (RDMA_MSG) | Call | ------------------------------> | | RDMA Read | | <------------------------------ | | RDMA Response (arg data) | | ------------------------------> | | | | | Processing | | | RDMA Send (RDMA_MSG) | | <------------------------------ | Reply
An RPC-over-RDMA transaction with a Read chunk:
Requester Responder | RDMA Send (RDMA_MSG) | Call | ------------------------------> | | | | | Processing | | | RDMA Write (result data) | | <------------------------------ | | RDMA Send (RDMA_MSG) | | <------------------------------ | Reply
An RPC-over-RDMA transaction with a Write chunk:
Chunking and Message Continuation can be combined. After reduction, the sender MAY split the reduced RPC message into multiple Payload streams and then send it via a Continued Message.
When a Payload stream is larger than the receiver's inline threshold, the Payload stream is reduced by removing DDP-eligible data items and placing them in chunks to be moved separately. If there are no DDP-eligible data items in the Payload stream, or the Payload stream is still too large after it has been reduced, the sender uses either Message Continuation, or it can use RDMA Read or Write operations to convey the entire RPC message. The latter mechanism is referred to as a "Long Message".
To transmit a Long Message, the sender conveys only the Transport stream with an RDMA Send operation. The Payload stream is not included in the Send buffer in this instance. Instead, the Requester provides chunks that the Responder uses to move the Payload stream.
Though the purpose of a Long Message is to handle large RPC messages, Requesters MAY use a Long Message at any time to convey an RPC Call message.
A Responder chooses which form of reply to use based on the chunks provided by the Requester. If Write chunks were provided and the Responder has a DDP-eligible result, it first reduces the reply Payload stream. If a Reply chunk was provided and the reduced Payload stream is larger than the reply inline threshold, the Responder MUST use the Requester-provided Reply chunk for the reply.
XDR data items may appear in these special chunks without regard to their DDP-eligibility. As these chunks contain a Payload stream, such chunks MUST include appropriate XDR roundup padding to maintain proper XDR alignment of their contents.
Requester Responder | RDMA Send (RDMA_NOMSG) | Call | ------------------------------> | | RDMA Read | | <------------------------------ | | RDMA Response (RPC call) | | ------------------------------> | | | | | Processing | | | RDMA Send (RDMA_MSG) | | <------------------------------ | Reply
An RPC-over-RDMA transaction using a Long Call:
Requester Responder | RDMA Send (RDMA_MSG) | Call | ------------------------------> | | | | | Processing | | | RDMA Write (RPC reply) | | <------------------------------ | | RDMA Send (RDMA_NOMSG) | | <------------------------------ | Reply
An RPC-over-RDMA transaction using a Long Reply:
RPC-over-RDMA version 2 provides a mechanism for connection endpoints to communicate information about implementation properties, enabling compatible endpoints to optimize data transfer. Initially only a small set of transport properties are defined and a single operation is provided to exchange transport properties (see Section 6.4.4).
Both the set of transport properties and the operations used to communicate may be extended. Within RPC-over-RDMA version 2, all such extensions are OPTIONAL. For information about existing transport properties, see Sections 5.1 through 5.2. For discussion of extensions to the set of transport properties, see Appendix B.3.
A basic set of receiver and sender properties is specified in this document. An extensible approach is used, allowing new properties to be defined in future Standards Track documents.
Such properties are specified using:
<CODE BEGINS> typedef rpcrdma2_propid uint32; struct rpcrdma2_propval { rpcrdma2_propid rdma_which; opaque rdma_data<>; }; typedef rpcrdma2_propval rpcrdma2_propset<>; typedef uint32 rpcrdma2_propsubset<>; <CODE ENDS>
The following XDR types are used by operations that deal with transport properties:
An rpcrdma2_propid specifies a particular transport property. In order to facilitate XDR extension of the set of properties by concatenating XDR definition files, specific properties are defined as const values rather than as elements in an enum.
An rpcrdma2_propval specifies a value of a particular transport property with the particular property identified by rdma_which, while the associated value of that property is contained within rdma_data.
An rdma_data field which is of zero length is interpreted as indicating the default value or the property indicated by rdma_which.
While rdma_data is defined as opaque within the XDR, the contents are interpreted (except when of length zero) using the XDR typedef associated with the property specified by rdma_which. As a result, when rpcrdma2_propval does not conform to that typedef, the receiver is REQUIRED to return the error RDMA2_ERR_BAD_XDR using the header type RDMA2_ERROR as described in Section 6.4.3. For example, the receiver of a message containing a valid rpcrdma2_propval returns this error if the length of rdma_data is such that it extends beyond the bounds of the message being transferred.
In cases in which the rpcrdma2_propid specified by rdma_which is understood by the receiver, the receiver also MUST report the error RDMA2_ERR_BAD_XDR if either of the following occur:
Note that no error is to be reported if rdma_which is unknown to the receiver. In that case, that rpcrdma2_propval is not processed and processing continues using the next rpcrdma2_propval, if any.
A rpcrdma2_propset specifies a set of transport properties. No particular ordering of the rpcrdma2_propval items within it is imposed.
A rpcrdma2_propsubset identifies a subset of the properties in a previously specified rpcrdma2_propset. Each bit in the mask denotes a particular element in a previously specified rpcrdma2_propset. If a particular rpcrdma2_propval is at position N in the array, then bit number N mod 32 in word N div 32 specifies whether that particular rpcrdma2_propval is included in the defined subset. Words beyond the last one specified are treated as containing zero.
Although the set of transport properties may be extended, a basic set of transport properties is defined in Table 1.
In that table, the columns contain the following information:
Property | Code | XDR type | Default | Sec |
---|---|---|---|---|
Maximum Send Size | 1 | uint32 | 4096 | 5.2.1 |
Receive Buffer Size | 2 | uint32 | 4096 | 5.2.2 |
Maximum RDMA Segment Size | 3 | uint32 | 1048576 | 5.2.3 |
Maximum RDMA Segment Count | 4 | uint32 | 16 | 5.2.4 |
Reverse Request Support | 5 | uint32 | 1 | 5.2.5 |
Host Auth Message | 6 | opaque<> | N/A | 5.2.6 |
<CODE BEGINS> const uint32 RDMA2_PROPID_SBSIZ = 1; typedef uint32 rpcrdma2_prop_sbsiz; <CODE ENDS>
The Maximum Send Size specifies the maximum size, in octets, of Send payloads. The endpoint sending this value ensures that it will not transmit a Send WR payload larger than this size, allowing the endpoint receiving this value to size its Receive buffers appropriately.
<CODE BEGINS> const uint32 RDMA2_PROPID_RBSIZ = 2; typedef uint32 rpcrdma2_prop_rbsiz; <CODE ENDS>
The Receive Buffer Size specifies the minimum size, in octets, of pre-posted receive buffers. It is the responsibility of the endpoint sending this value to ensure that its pre-posted receive buffers are at least the size specified, allowing the endpoint receiving this value to send messages that are of this size.
A sender may use his knowledge of the receiver's buffer size to determine when the message to be sent will fit in the preposted receive buffers that the receiver has set up. In particular,
<CODE BEGINS> const uint32 RDMA2_PROPID_RSSIZ = 3; typedef uint32 rpcrdma2_prop_rssiz; <CODE ENDS>
The Maximum RDMA Segment Size specifies the maximum size, in octets, of an RDMA segment this endpoint is prepared to send or receive.
<CODE BEGINS> const uint32 RDMA2_PROPID_RCSIZ = 4; typedef uint32 rpcrdma2_prop_rcsiz; <CODE ENDS>
The Maximum RDMA Segment Count specifies the maximum number of RDMA segments that can appear in a requester's transport header.
<CODE BEGINS> const uint32 RDMA_RVREQSUP_NONE = 0; const uint32 RDMA_RVREQSUP_INLINE = 1; const uint32 RDMA_RVREQSUP_GENL = 2; const uint32 RDMA2_PROPID_BRS = 5; typedef uint32 rpcrdma2_prop_brs; <CODE ENDS>
The value of this property is used to indicate a client implementation's readiness to accept and process messages that are part of reverse direction RPC requests.
Multiple levels of support are distinguished:
When information about this property is not provided, the support level of servers can be inferred from the reverse direction requests that they issue, assuming that issuing a request implicitly indicates support for receiving the corresponding reply. On this basis, support for receiving inline replies can be assumed when requests without Read chunks, Write chunks, or Reply chunks are issued, while requests with any of these elements allow the client to assume that general support for reverse direction replies is present on the server.
<CODE BEGINS> const uint32 RDMA2_PROPID_HOSTAUTH = 6; typedef opaque rpcrdma2_prop_hostauth<>; <CODE ENDS>
The value of this transport property is used as part of an exchange of host authentication material. This property can accommodate authentication handshakes that require multiple challenge-response interactions, and potentially large amounts of material.
When this property is not provided, the peer(s) remain unauthenticated. Local security policy on each peer determines whether the connection is permitted to continue.
Each transport message consists of multiple sections:
This organization differs from that presented in the definition of RPC-over-RDMA version 1 [RFC8166], which presented the first and second of the items above as a single XDR item. The new organization is more in keeping with RPC-over-RDMA version 2's extensibility model in that new header types can be defined without modifying the existing set of header types.
The new header types within RPC-over-RDMA version 2 are set forth in Table 2. In that table, the columns contain the following information:
Operation | Code | XDR type | Msg | Sec |
---|---|---|---|---|
Convey Appended RPC Message | 0 | rpcrdma2_msg | Yes | 6.4.1 |
Convey External RPC Message | 1 | rpcrdma2_nomsg | No | 6.4.2 |
Report Transport Error | 4 | rpcrdma2_err | No | 6.4.3 |
Specify Properties at Connection | 5 | rpcrdma2_connprop | No | 6.4.4 |
Suppport for the operations in Table 2 is REQUIRED. Support for additional operations will be OPTIONAL. RPC-over-RDMA version 2 implementations that receive an OPTIONAL operation that is not supported MUST respond with an RDMA2_ERROR message with an error code of RDMA2_ERR_INVAL_HTYPE.
Most RPC-over-RDMA version 2 data structures are derived from corresponding structures in RPC-over-RDMA version 1. As is typical for new versions of an existing protocol, the XDR data structures have new names and there are a few small changes in content. In some cases, there have been structural re-organizations to enabled protocol extensibility.
<CODE BEGINS> struct rpcrdma_common { uint32 rdma_xid; uint32 rdma_vers; uint32 rdma_credit; uint32 rdma_htype; }; <CODE ENDS>
The rpcrdma_common prefix describes the first part of each RDMA-over-RPC transport header for version 2 and subsequent versions.
RPC-over-RDMA version 2's use of these first four words matches that of version 1 as required by [RFC8166]. However, there are important structural differences in the way that these words are described by the respective XDR descriptions: Section 7 reflects these changes, which are discussed in further detail in Appendix C.1.
These changes are part of a larger structural change in the XDR description of RPC-over-RDMA version 2 that enables a cleaner treatment of protocol extension. The XDR appearing in
<CODE BEGINS> const RPCRDMA2_F_RESPONSE 0x00000001; const RPCRDMA2_F_MORE 0x00000002; struct rpcrdma2_hdr_prefix struct rpcrdma_common rdma_start; uint32 rdma_flags; }; <CODE ENDS>
The following prefix structure appears at the start of any RPC-over-RDMA version 2 transport header.
The rdma_flags is new to RPC-over-RDMA version 2. Currently, the only flags defined within this word are the RPCRDMA2_F_RESPONSE flag and the RPCRDMA2_F_MORE flag. The other bits are reserved for future use as described in Appendix B.2. The sender MUST set these flags to zero.
The RPCRDMA2_F_RESPONSE flag qualifies the value contained in the transport header's rdma_start.rdma_xid field. The RPCRDMA2_F_RESPONSE flag enables a receiver to reliably avoid performing an XID lookup on incoming reverse direction Call messages.
In general, when a message carries an XID that was generated by the message's receiver (that is, the receiver is acting as a requester), the message's sender sets the RPCRDMA2_F_RESPONSE flag. Otherwise that flag is clear. For example:
The RPCRDMA2_F_MORE flag signifies that the RPC-over-RDMA message payload continues in the next message. This is referred to as Message Continuation, or Send chaining.
When the RPCRDMA2_F_MORE flag is asserted, the receiver is to concatenate the data payload of the next received message to the end of the data payload of the current received message. The sender clears the RPCRDMA2_F_MORE flag in the final message in the sequence.
All RPC-over-RDMA messages in such a sequence MUST have the same values in the rdma_start.rdma_xid and rdma_start.rdma_htype fields. If this constraint is not met, the receiver MUST respond with an RDMA2_ERROR message with the rdma_err field set to RDMA2_ERR_INVAL_FLAG.
If a peer receives an RPC-over-RDMA message where the RPCRDMA2_F_MORE flag is set and the rdma_start.rdma_htype field does not contain RDMA2_MSG or RDMA2_CONNPROP, the receiver MUST respond with an RDMA2_ERROR message with the rdma_err field set to RDMA2_ERR_INVAL_FLAG.
[ dnoveck: Both the above and your error in the existing third paragraph raise issues since they could be sent by a responder. Will need to fix RDMA2_ERROR so that this can be done when appropriate. ]
When the RPCRDMA2_F_MORE flag is set in an individual message, that message's chunk lists MUST be empty. Chunks for a chained message may be conveyed in the final message in the sequence, whose RPCRDMA2_F_MORE flag is clear.
There is no protocol-defined limit on the number of concatenated messages in a sequence. If the sender exhausts the receiver's credit grant before the final message is sent, the sender MUST wait for a further credit grant from the receiver before continuing to send messages.
Credit exhaustion can occur at the receiver in the middle of a sequence of continued messages. To enable the sender to continue sending the remaining messages in the sequence, the receiver can grant more credits by sending an RPC message payload or an out-of-band credit grant (see Section 4.3.1.2).
<CODE BEGINS> struct rpcrdma2_chunk_lists { uint32 rdma_inv_handle; struct rpcrdma2_read_list *rdma_reads; struct rpcrdma2_write_list *rdma_writes; struct rpcrdma2_write_chunk *rdma_reply; }; <CODE ENDS>
The rpcrdma2_chunk_lists structure specifies how an RPC message is conveyed using explicit RDMA operations.
For the most part this structure parallels its RPC-over-RDMA version 1 equivalent. That is, the rdma_reads, rdma_writes, rdma_reply fields provide, respectively, descriptions of the chunks used to read a Long message or directly placed data from the requester, to write directly placed response data into the requester's memory, and to write a long reply into the requester's memory.
The chunks and chunk list structures follow the same rules as in Section 3.4 of [RFC8166], with these exceptions:
An important addition relative to the corresponding RPC-over-RDMA version 1 rdma_header structures is the rdma_inv_handle field. This field supports remote invalidation of requester memory registrations via the RDMA Send With Invalidate operation.
To request Remote Invalidation, a requester sets the value of the rdma_inv_handle field in an RPC Call's transport header to a non-zero value that matches one of the rdma_handle fields in that header. If none of the rdma_handle values in the header conveying the Call may be invalidated by the responder, the requester sets the RPC Call's rdma_inv_handle field to the value zero.
If the responder chooses not to use remote invalidation for this particular RPC Reply, or the RPC Call's rdma_inv_handle field contains the value zero, the responder uses RDMA Send to transmit the matching RPC reply.
If a requester has provided a non-zero value in the RPC Call's rdma_inv_handle field and the responder chooses to use Remote Invalidation for the matching RPC Reply, the responder uses RDMA Send With Invalidate to transmit that RPC reply, and uses the value in the corresponding Call's rdma_inv_handle field to construct the Send With Invalidate Work Request.
The header types defined and used in RPC-over-RDMA version 1 are all carried over into RPC-over-RDMA version 2, although there may be limited changes in the definition of existing header types.
In comparison with the header types of RPC-over-RDMA version 1, the changes can be summarized as follows:
<CODE BEGINS> const rpcrdma2_proc RDMA2_MSG = 0; struct rpcrdma2_msg { struct rpcrdma2_chunk_lists rdma_chunks; /* The rpc message starts here and continues * through the end of the transmission. */ uint32 rdma_rpc_first_word; }; <CODE ENDS>
RDMA2_MSG is used to convey an RPC message that immediately follows the Transport Header in the Send buffer. This is either an RPC request that has no Position Zero Read chunk or an RPC reply that is not sent using a Reply chunk.
RDMA2_NOMSG can convey an entire RPC message payload using explicit RDMA operations. When an RPC message payload is present, this message type is also known as a Long message. In particular, it is a Long call when the responder reads the RPC payload from a memory area specified by a Position Zero Read chunk; and it is a Long reply when the respond writes the RPC payload into a memory area specified by a Reply chunk. In both of these cases, the rdma_xid field is set to the same value as the xid of the RPC message payload.
<CODE BEGINS> const rpcrdma2_proc RDMA2_NOMSG = 1; struct rpcrdma2_nomsg { struct rpcrdma2_chunk_lists rdma_chunks; }; <CODE ENDS>
If all the chunk lists are empty (i.e., three 32-bit zeroes in the chunk list fields), the message conveys a credit grant refresh. The header prefix of this message contains a credit grant refresh in the rdma_credit field. In this case, the sender MUST set the rdma_xid field to zero.
In RPC-over-RDMA version 2, an alternative to using a Long message is to use Message Continuation.
<CODE BEGINS> const rpcrdma2_proc RDMA2_ERROR = 4; struct rpcrdma2_err_vers { uint32 rdma_vers_low; uint32 rdma_vers_high; }; struct rpcrdma2_err_write { uint32 rdma_chunk_index; uint32 rdma_length_needed; }; union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { case RDMA2_ERR_VERS: rpcrdma2_err_vers rdma_vrange; case RDMA2_ERR_READ_CHUNKS: uint32 rdma_max_chunks; case RDMA2_ERR_WRITE_CHUNKS: uint32 rdma_max_chunks; case RDMA2_ERR_SEGMENTS: uint32 rdma_max_segments; case RDMA2_ERR_WRITE_RESOURCE: rpcrdma2_err_write rdma_writeres; case RDMA2_ERR_REPLY_RESOURCE: uint32 rdma_length_needed; default: void; }; <CODE ENDS>
RDMA2_ERROR provides a way of reporting the occurrence of transport errors on a previous transmission. This header type MUST NOT be transmitted by a requester.
Error reporting is addressed in RPC-over-RDMA version 2 in a fashion similar to RPC-over-RDMA version 1. Several new error codes, and error messages never flow from requester to responder. RPC-over-RDMA version 1 error reporting is described in Section 5 of [RFC8166].
Unless otherwise specified, in all cases below, the responder copies the values of the rdma_start.rdma_xid and rdma_start.rdma_vers fields from the incoming transport header that generated the error to transport header of the error response. The responder sets the rdma_start.rdma_htype field of the transport header prefix to RDMA2_ERROR, and the rdma_start.rdma_credit field is set to the credit grant value for this connection. The receiver of this header type MUST ignore the value of the rdma_start.rdma_credit field.
The RDMA2_CONNPROP message type allows an RPC-over-RDMA endpoint, whether client or server, to indicate to its partner relevant transport properties that the partner might need to be aware of.
<CODE BEGINS> struct rpcrdma2_connprop { rpcrdma2_propset rdma_props; }; <CODE ENDS>
The message definition for this operation is as follows:
All relevant transport properties that the sender is aware of should be included in rdma_props. Since support of each of the properties is OPTIONAL, the sender cannot assume that the receiver will necessarily take note of these properties. The sender should be prepared for cases in which the receiver continues to assume that the default value for a particular property is still in effect.
Generally, a participant will send a RDMA2_CONNPROP message as the first message after a connection is established. Given that fact, the sender should make sure that the message can be received by peers who use the default Receive Buffer Size. The connection's initial receive buffer size is typically 1KB, but it depends on the initial connection state of the RPC-over-RDMA version in use.
Properties not included in rdma_props are to be treated by the peer endpoint as having the default value and are not allowed to change subsequently. The peer should not request changes in such properties.
Those receiving an RDMA2_CONNPROP may encounter properties that they do not support or are unaware of. In such cases, these properties are simply ignored without any error response being generated.
A requester provides any necessary registered memory resources for both an RPC Call message and its matching RPC Reply message. A requester forms each RPC Call itself, thus it can compute the exact memory resources needed to send every Call. However, the requester must allocate memory resources to receive the corresponding Reply before the responder has formed it. In some cases it is difficult for the requester to know in advance precisely what resources will be needed to receive the Reply.
In RPC-over-RDMA version 2, a requester MAY provide a Reply chunk at any time. The responder MAY use the provided Reply chunk or decide to use another means to convey the RPC Reply. If the combination of the provided Write chunk list and Reply chunk is not adequate to convey a Reply, the responder SHOULD use Message Continuation (see Section 6.3.2.2 to send that Reply.
If even that is not possible, the responder sends an RDMA2_ERROR message to the requester, as described in Section 6.4.3:
When receiving such errors, the requester SHOULD retry the ULP call using larger reply resources. In cases where retrying the ULP request is not possible, the requester terminates the RPC request and presents an error to the RPC consumer.
This section contains a description of the core features of the RPC-over-RDMA version 2 protocol expressed in the XDR language [RFC4506].
Because of the need to provide for protocol extensibility without modifying an existing XDR definition, this description has some important structural differences from the corresponding XDR description for RPC-over-RDMA version 1, which appears in [RFC8166].
This description is divided into three parts:
This description is provided in a way that makes it simple to extract into ready-to-compile form. To enable the combination of this description with the descriptions of subsequent extensions to RPC-over-RDMA version 2, the extracted description can be combined with similar descriptions published later, or those descriptions can be compiled separately. Refer to Section 7.2 for details.
<CODE BEGINS> /// /* /// * Copyright (c) 2010-2018 IETF Trust and the persons /// * identified as authors of the code. All rights reserved. /// * /// * The authors of the code are: /// * B. Callaghan, T. Talpey, C. Lever, and D. Noveck. /// * /// * Redistribution and use in source and binary forms, with /// * or without modification, are permitted provided that the /// * following conditions are met: /// * /// * - Redistributions of source code must retain the above /// * copyright notice, this list of conditions and the /// * following disclaimer. /// * /// * - Redistributions in binary form must reproduce the above /// * copyright notice, this list of conditions and the /// * following disclaimer in the documentation and/or other /// * materials provided with the distribution. /// * /// * - Neither the name of Internet Society, IETF or IETF /// * Trust, nor the names of specific contributors, may be /// * used to endorse or promote products derived from this /// * software without specific prior written permission. /// * /// * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS /// * AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED /// * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE /// * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS /// * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO /// * EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE /// * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, /// * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT /// * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR /// * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS /// * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF /// * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, /// * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING /// * IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF /// * ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. /// */ /// <CODE ENDS>
Code components extracted from this document must include the following license text. When the extracted XDR code is combined with other complementary XDR code which itself has an identical license, only a single copy of the license text need be preserved.
<CODE BEGINS> sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' <CODE ENDS>
The reader can apply the following sed script to this document to produce a machine-readable XDR description of the RPC-over-RDMA version 2 protocol without any OPTIONAL extensions.
<CODE BEGINS> sed -n -e 's:^ */// ::p' -e 's:^ *///$::p' \ < spec.txt > rpcrdma-v2.x <CODE ENDS>
That is, if this document is in a file called "spec.txt" then the reader can do the following to extract an XDR description file and store it in the file rpcrdma-v2.x.
<CODE BEGINS> #!/usr/local/bin/perl open(IN,"rpcrdma-v2.x"); open(OUT,">temp.x"); while(<IN>) { if (m/FILE ENDS: (.*)$/) { close(OUT); rename("temp.x", $1); open(OUT,">temp.x"); } else { print OUT $_; } } close(IN); close(OUT); <CODE ENDS>
Although this file is a usable description of the base protocol, when extensions are to supported, it may be desirable to divide into multiple files. The following script can be used for that purpose:
Running the above script will result in two files:
Optional extensions to RPC-over-RDMA version 2, published as Standards Track documents, will have similar means of providing XDR that describes those extensions. Once XDR for all desired extensions is also extracted, it can be appended to the XDR description file extracted from this document to produce a consolidated XDR description file reflecting all extensions selected for an RPC-over-RDMA implementation.
Alternatively, the XDR descriptions can be compiled separately. In this case the combination of common.x and baseops.x serves to define the base transport, while using as XDR descriptions for extensions, the XDR from the document defining that extension, together with the file common.x, obtained from this document.
<CODE BEGINS> /// /******************************************************************* /// * Transport Header Prefixes /// ******************************************************************/ /// /// struct rpcrdma_common { /// uint32 rdma_xid; /// uint32 rdma_vers; /// uint32 rdma_credit; /// uint32 rdma_htype; /// }; /// /// const RPCRDMA2_F_RESPONSE 0x00000001; /// const RPCRDMA2_F_MORE 0x00000002; /// /// struct rpcrdma2_hdr_prefix /// struct rpcrdma_common rdma_start; /// uint32 rdma_flags; /// }; /// /// /******************************************************************* /// * Chunks and Chunk Lists /// ******************************************************************/ /// /// struct rpcrdma2_segment { /// uint32 rdma_handle; /// uint32 rdma_length; /// uint64 rdma_offset; /// }; /// /// struct rpcrdma2_read_segment { /// uint32 rdma_position; /// struct rpcrdma2_segment rdma_target; /// }; /// /// struct rpcrdma2_read_list { /// struct rpcrdma2_read_segment rdma_entry; /// struct rpcrdma2_read_list *rdma_next; /// }; /// /// struct rpcrdma2_write_chunk { /// struct rpcrdma2_segment rdma_target<>; /// }; /// /// struct rpcrdma2_write_list { /// struct rpcrdma2_write_chunk rdma_entry; /// struct rpcrdma2_write_list *rdma_next; /// }; /// /// struct rpcrdma2_chunk_lists { /// uint32 rdma_inv_handle; /// struct rpcrdma2_read_list *rdma_reads; /// struct rpcrdma2_write_list *rdma_writes; /// struct rpcrdma2_write_chunk *rdma_reply; /// }; /// /// /******************************************************************* /// * Transport Properties /// ******************************************************************/ /// /// /* /// * Types for transport properties model /// */ /// typedef rpcrdma2_propid uint32; /// /// struct rpcrdma2_propval { /// rpcrdma2_propid rdma_which; /// opaque rdma_data<>; /// }; /// /// typedef rpcrdma2_propval rpcrdma2_propset<>; /// typedef uint32 rpcrdma2_propsubset<>; /// /// /* /// * Transport propid values for basic properties /// */ /// const uint32 RDMA2_PROPID_SBSIZ = 1; /// const uint32 RDMA2_PROPID_RBSIZ = 2; /// const uint32 RDMA2_PROPID_RSSIZ = 3; /// const uint32 RDMA2_PROPID_RCSIZ = 4; /// const uint32 RDMA2_PROPID_BRS = 5; /// const uint32 RDMA2_PROPID_HOSTAUTH = 6; /// /// /* /// * Types specific to particular properties /// */ /// typedef uint32 rpcrdma2_prop_sbsiz; /// typedef uint32 rpcrdma2_prop_rbsiz; /// typedef uint32 rpcrdma2_prop_rssiz; /// typedef uint32 rpcrdma2_prop_rcsiz; /// typedef uint32 rpcrdma2_prop_brs; /// typedef opaque rpcrdma2_prop_hostauth<>; /// /// const uint32 RDMA_RVREQSUP_NONE = 0; /// const uint32 RDMA_RVREQSUP_INLINE = 1; /// const uint32 RDMA_RVREQSUP_GENL = 2; /// /// /* FILE ENDS: common.x; */ <CODE ENDS>
<CODE BEGINS> /// /******************************************************************* /// * Descriptions of RPC-over-RDMA Header Types /// ******************************************************************/ /// /// /* /// * Header Type Codes. /// */ /// const rpcrdma2_proc RDMA2_MSG = 0; /// const rpcrdma2_proc RDMA2_NOMSG = 1; /// const rpcrdma2_proc RDMA2_ERROR = 4; /// const rpcrdma2_proc RDMA2_CONNPROP = 5; /// /// /* /// * Header Types to Convey RPC Messages. /// */ /// struct rpcrdma2_msg { /// struct rpcrdma2_chunk_lists rdma_chunks; /// /// /* The rpc message starts here and continues /// * through the end of the transmission. */ /// uint32 rdma_rpc_first_word; /// }; /// /// struct rpcrdma2_nomsg { /// struct rpcrdma2_chunk_lists rdma_chunks; /// }; /// /// /* /// * Header Type to Report Errors. /// */ /// const uint32 RDMA2_ERR_VERS = 1; /// const uint32 RDMA2_ERR_BAD_XDR = 2; /// const uint32 RDMA2_ERR_INVAL_HTYPE = 3; /// const uint32 RDMA2_ERR_INVAL_FLAG = 4; /// const uint32 RDMA2_ERR_READ_CHUNKS = 5; /// const uint32 RDMA2_ERR_WRITE_CHUNKS = 6; /// const uint32 RDMA2_ERR_SEGMENTS = 7; /// const uint32 RDMA2_ERR_WRITE_RESOURCE = 8; /// const uint32 RDMA2_ERR_REPLY_RESOURCE = 9; /// const uint32 RDMA2_ERR_SYSTEM = 10; /// /// struct rpcrdma2_err_vers { /// uint32 rdma_vers_low; /// uint32 rdma_vers_high; /// }; /// /// struct rpcrdma2_err_write { /// uint32 rdma_chunk_index; /// uint32 rdma_length_needed; /// }; /// /// union rpcrdma2_error switch (rpcrdma2_errcode rdma_err) { /// case RDMA2_ERR_VERS: /// rpcrdma2_err_vers rdma_vrange; /// case RDMA2_ERR_READ_CHUNKS: /// uint32 rdma_max_chunks; /// case RDMA2_ERR_WRITE_CHUNKS: /// uint32 rdma_max_chunks; /// case RDMA2_ERR_SEGMENTS: /// uint32 rdma_max_segments; /// case RDMA2_ERR_WRITE_RESOURCE: /// rpcrdma2_err_write rdma_writeres; /// case RDMA2_ERR_REPLY_RESOURCE: /// uint32 rdma_length_needed; /// default: /// void; /// }; /// /// /* /// * Header Type to Exchange Transport Properties. /// */ /// struct rpcrdma2_connprop { /// rpcrdma2_propset rdma_props; /// }; /// /// /* FILE ENDS: baseops.x; */ <CODE ENDS>
The three files common.x and baseops.x, when combined with the XDR descriptions for extension defined later, produce a human-readable and compilable description of the RPC-over-RDMA version 2 protocol with the included extensions.
Although this XDR description can be useful in generating code to encode and decode the transport and payload streams, there are elements of the structure of RPC-over-RDMA version 2 which are not expressible within the XDR language as currently defined. This requires implementations that use the output of the XDR processor to provide additional code to bridge the gaps.
To summarize, the role of XDR in this specification is more limited than for protocols which are themselves XDR programs, where the totality of the protocol is expressible within the XDR paradigm established for that purpose. This more limited role reflects the fact that XDR lacks facilities to represent the embedding of transported material within the transport framework. In addition, the need to cleanly accommodate extensions has meant that those using rpcgen in their applications need to take a more active role in providing the facilities that cannot be expressed within XDR.
In setting up a new RDMA connection, the first action by an RPC client is to obtain a transport address for the RPC server. The means used to obtain this address and to open an RDMA connection is dependent on the type of RDMA transport, and is the responsibility of each RPC protocol binding and its local implementation.
RPC services normally register with a portmap or rpcbind service [RFC1833], which associates an RPC Program number with a service address. This policy is no different with RDMA transports. However, a different and distinct service address (port number) might sometimes be required for ULP operation with RPC-over-RDMA.
When mapped atop the iWARP transport [RFC5040] [RFC5041], which uses IP port addressing due to its layering on TCP and/or SCTP, port mapping is trivial and consists merely of issuing the port in the connection process. The NFS/RDMA protocol service address has been assigned port 20049 by IANA, for both iWARP/TCP and iWARP/SCTP [RFC8267].
When mapped atop InfiniBand [IBA], which uses a service endpoint naming scheme based on a Group Identifier (GID), a translation MUST be employed. One such translation is described in Annexes A3 (Application Specific Identifiers), A4 (Sockets Direct Protocol (SDP)), and A11 (RDMA IP CM Service) of [IBA], which is appropriate for translating IP port addressing to the InfiniBand network. Therefore, in this case, IP port addressing may be readily employed by the upper layer.
When a mapping standard or convention exists for IP ports on an RDMA interconnect, there are several possibilities for each upper layer to consider:
Historically, different RPC protocols have taken different approaches to their port assignment. Therefore, the specific method is left to each RPC-over-RDMA-enabled ULB and is not addressed in this document.
[RFC8166] defines two new netid values to be used for registration of upper layers atop iWARP [RFC5040] [RFC5041] and (when a suitable port translation service is available) InfiniBand [IBA]. Additional RDMA-capable networks MAY define their own netids, or if they provide a port translation, they MAY share the one defined in [RFC8166].
This section records the status of known implementations of the protocol defined by this specification at the time of posting of this Internet-Draft, and is based on a proposal described in [RFC7942]. The description of implementations in this section is intended to assist the IETF in its decision processes in progressing drafts to RFCs.
Please note that the listing of any individual implementation here does not imply endorsement by the IETF. Furthermore, no effort has been spent to verify the information presented here that was supplied by IETF contributors. This is not intended as, and must not be construed to be, a catalog of available implementations or their features. Readers are advised to note that other implementations may exist.
At this time, no known implementations of the protocol described in this document exist.
A primary consideration is the protection of the integrity and confidentiality of host memory by an RPC-over-RDMA transport. The use of an RPC-over-RDMA transport protocol MUST NOT introduce vulnerabilities to system memory contents nor to memory owned by user processes.
It is REQUIRED that any RDMA provider used for RPC transport be conformant to the requirements of [RFC5042] in order to satisfy these protections. These protections are provided by the RDMA layer specifications, and in particular, their security models.
The use of Protection Domains to limit the exposure of memory regions to a single connection is critical. Any attempt by an endpoint not participating in that connection to reuse memory handles needs to result in immediate failure of that connection. Because ULP security mechanisms rely on this aspect of Reliable connected behavior, strong authentication of remote endpoints is recommended.
Unpredictable memory handles should be used for any operation requiring advertised memory regions. Advertising a continuously registered memory region allows a remote host to read or write to that region even when an RPC involving that memory is not under way. Therefore, implementations should avoid advertising persistently registered memory.
Requesters should register memory regions for remote access only when they are about to be the target of an RPC operation that involves an RDMA Read or Write.
Registered memory regions should be invalidated as soon as related RPC operations are complete. Invalidation and DMA unmapping of memory regions should be complete before message integrity checking is done and before the RPC consumer is allowed to continue execution and use or alter the contents of a memory region.
An RPC transaction on a Requester might be terminated before a reply arrives if the RPC consumer exits unexpectedly (for example, it is signaled or a segmentation fault occurs). When an RPC terminates abnormally, memory regions associated with that RPC should be invalidated appropriately before the regions are released to be reused for other purposes on the Requester.
A detailed discussion of denial-of-service exposures that can result from the use of an RDMA transport is found in Section 6.4 of [RFC5042].
A Responder is not obliged to pull Read chunks that are unreasonably large. The Responder can use an RDMA2_ERROR response to terminate RPCs with unreadable Read chunks. If a Responder transmits more data than a Requester is prepared to receive in a Write or Reply chunk, the RDMA Network Interface Cards (RNICs) typically terminate the connection. For further discussion, see Section 6.4.3. Such repeated chunk errors can deny service to other users sharing the connection from the errant Requester.
An RPC-over-RDMA transport implementation is not responsible for throttling the RPC request rate, other than to keep the number of concurrent RPC transactions at or under the number of credits granted per connection. This is explained in Section 4.3.1. A sender can trigger a self denial of service by exceeding the credit grant repeatedly.
When an RPC has been canceled due to a signal or premature exit of an application process, a Requester typically invalidates the RPC's Write and Reply chunks. Invalidation prevents the subsequent arrival of the Responder's reply from altering the memory regions associated with those chunks after the memory has been reused.
On the Requester, a malfunctioning application or a malicious user can create a situation where RPCs are continuously initiated and then aborted, resulting in Responder replies that terminate the underlying RPC-over-RDMA connection repeatedly. Such situations can deny service to other users sharing the connection from that Requester.
ONC RPC provides cryptographic security via the RPCSEC_GSS framework [RFC7861]. RPCSEC_GSS implements message authentication (rpc_gss_svc_none), per-message integrity checking (rpc_gss_svc_integrity), and per-message confidentiality (rpc_gss_svc_privacy) in the layer above the RPC-over-RDMA transport. The latter two services require significant computation and movement of data on each endpoint host. Some performance benefits enabled by RDMA transports can be lost.
For any RPC transport, utilizing RPCSEC_GSS integrity or privacy services has performance implications. Protection below the RPC transport is often more appropriate in performance-sensitive deployments, especially if it, too, can be offloaded. Certain configurations of IPsec can be co-located in RDMA hardware, for example, without change to RDMA consumers and little loss of data movement efficiency. Such arrangements can also provide a higher degree of privacy by hiding endpoint identity or altering the frequency at which messages are exchanged, at a performance cost.
The use of protection in a lower layer MAY be negotiated through the use of an RPCSEC_GSS security flavor defined in [RFC7861] in conjunction with the Channel Binding mechanism [RFC5056] and IPsec Channel Connection Latching [RFC5660]. Use of such mechanisms is REQUIRED where integrity or confidentiality is desired and where efficiency is required.
Not all RDMA devices and fabrics support the above protection mechanisms. Also, per-message authentication is still required on NFS clients where multiple users access NFS files. In these cases, RPCSEC_GSS can protect NFS traffic conveyed on RPC-over-RDMA connections.
RPCSEC_GSS extends the ONC RPC protocol without changing the format of RPC messages. By observing the conventions described in this section, an RPC-over-RDMA transport can convey RPCSEC_GSS-protected RPC messages interoperably.
As part of the ONC RPC protocol, protocol elements of RPCSEC_GSS that appear in the Payload stream of an RPC-over-RDMA message (such as control messages exchanged as part of establishing or destroying a security context or data items that are part of RPCSEC_GSS authentication material) MUST NOT be reduced.
Some NFS client implementations use a separate connection to establish a Generic Security Service (GSS) context for NFS operation. Such clients use TCP and the standard NFS port (2049) for context establishment. To enable the use of RPCSEC_GSS with NFS/RDMA, an NFS server MUST also provide a TCP-based NFS service on port 2049.
The RPCSEC_GSS authentication service has no impact on the DDP-eligibility of data items in a ULP.
However, RPCSEC_GSS authentication material appearing in an RPC message header can be larger than, say, an AUTH_SYS authenticator. In particular, when an RPCSEC_GSS pseudoflavor is in use, a Requester needs to accommodate a larger RPC credential when marshaling RPC Call messages and needs to provide for a maximum size RPCSEC_GSS verifier when allocating reply buffers and Reply chunks.
RPC messages, and thus Payload streams, are made larger as a result. ULP operations that fit in a Short Message when a simpler form of authentication is in use might need to be reduced or conveyed via a Long Message when RPCSEC_GSS authentication is in use. It is more likely that a Requester provides both a Read list and a Reply chunk in the same RPC-over-RDMA Transport header to convey a Long Call and provision a receptacle for a Long Reply.
In addition to this cost, the XDR encoding and decoding of each RPC message using RPCSEC_GSS authentication requires host compute resources to construct the GSS verifier.
The RPCSEC_GSS integrity service enables endpoints to detect modification of RPC messages in flight. The RPCSEC_GSS privacy service prevents all but the intended recipient from viewing the cleartext content of RPC arguments and results. RPCSEC_GSS integrity and privacy services are end-to-end. They protect RPC arguments and results from application to server endpoint, and back.
The RPCSEC_GSS integrity and encryption services operate on whole RPC messages after they have been XDR encoded for transmit, and before they have been XDR decoded after receipt. Both sender and receiver endpoints use intermediate buffers to prevent exposure of encrypted data or unverified cleartext data to RPC consumers. After verification, encryption, and message wrapping has been performed, the transport layer MAY use RDMA data transfer between these intermediate buffers.
The process of reducing a DDP-eligible data item removes the data item and its XDR padding from the encoded Payload stream. XDR padding of a reduced data item is not transferred in a normal RPC-over-RDMA message. After reduction, the Payload stream contains fewer octets than the whole XDR stream did beforehand. XDR padding octets are often zero bytes, but they don't have to be. Thus, reducing DDP-eligible items affects the result of message integrity verification or encryption.
Therefore, a sender MUST NOT reduce a Payload stream when RPCSEC_GSS integrity or encryption services are in use. Effectively, no data item is DDP-eligible in this situation, and Chunked Messages cannot be used. In this mode, an RPC-over-RDMA transport operates in the same manner as a transport that does not support DDP.
When an RPCSEC_GSS integrity or privacy service is in use, a Requester provides both a Read list and a Reply chunk in the same RPC-over-RDMA header to convey a Long Call and provision a receptacle for a Long Reply.
Like the base fields in an ONC RPC message (XID, call direction, and so on), the contents of an RPC-over-RDMA message's Transport stream are not protected by RPCSEC_GSS. This exposes XIDs, connection credit limits, and chunk lists (but not the content of the data items they refer to) to malicious behavior, which could redirect data that is transferred by the RPC-over-RDMA message, result in spurious retransmits, or trigger connection loss.
In particular, if an attacker alters the information contained in the chunk lists of an RPC-over-RDMA Transport header, data contained in those chunks can be redirected to other registered memory regions on Requesters. An attacker might alter the arguments of RDMA Read and RDMA Write operations on the wire to similar effect. If such alterations occur, the use of RPCSEC_GSS integrity or privacy services enable a Requester to detect unexpected material in a received RPC message.
Encryption at lower layers, as described in Section 10.2.1 protects the content of the Transport stream. To address attacks on RDMA protocols themselves, RDMA transport implementations should conform to [RFC5042].
Like other fields that appear in each RPC-over-RDMA header, property information is sent in the clear on the fabric with no integrity protection, making it vulnerable to man-in-the-middle attacks.
For example, if a man-in-the-middle were to change the value of the Receive buffer size or the Requester Remote Invalidation boolean, it could reduce connection performance or trigger loss of connection. Repeated connection loss can impact performance or even prevent a new connection from being established. Recourse is to deploy on a private network or use link-layer encryption.
Wherein we use the relevant sections of [RFC3552] to analyze the addition of host authentication to this RPC-over-RDMA transport.
The authors refer readers to Appendix C of [RFC8446] for information on how to design and test a secure authentication handshake implementation.
The RPC-over-RDMA family of transports have been assigned RPC netids by [RFC8166]. A netid is an rpcbind [RFC1833] string used to identify the underlying protocol in order for RPC to select appropriate transport framing and the format of the service addresses and ports.
NC_RDMA "rdma" NC_RDMA6 "rdma6"
The following netid registry strings are already defined for this purpose:
The "rdma" netid is to be used when IPv4 addressing is employed by the underlying transport, and "rdma6" when IPv6 addressing is employed. The netid assignment policy and registry are defined in [RFC5665]. The current document does not alter these netid assignments.
These netids MAY be used for any RDMA network that satisfies the requirements of Section 3.2.2 and that is able to identify service endpoints using IP port addressing, possibly through use of a translation service as described in Section 8.
An Upper-Layer Protocol (ULP) is typically defined independently of any particular RPC transport. An Upper-Layer Binding (ULB) specification provides guidance that helps the ULP interoperate correctly and efficiently over a particular transport. For RPC-over-RDMA version 2, a ULB may provide:
Each RPC Program and Version tuple that utilizes RPC-over-RDMA version 2 needs to have a ULB specification.
An ULB designates some XDR data items as eligible for DDP. As an RPC-over-RDMA message is formed, DDP-eligible data items can be removed from the Payload stream and placed directly in the receiver's memory. An XDR data item should be considered for DDP-eligibility if there is a clear benefit to moving the contents of the item directly from the sender's memory to the receiver's memory.
Criteria for DDP-eligibility include:
In addition to defining the set of data items that are DDP-eligible, a ULB may also limit the use of chunks to particular upper-layer procedures. If more than one data item in a procedure is DDP-eligible, the ULB may also limit the number of chunks that a requester can provide for a particular upper-layer procedure.
Senders MUST NOT reduce data items that are not DDP-eligible. Such data items MAY, however, be moved as part of a Position Zero Read chunk or a Reply chunk.
The programming interface by which an upper-layer implementation indicates the DDP-eligibility of a data item to the RPC transport is not described by this specification. The only requirements are that the receiver can re-assemble the transmitted RPC-over-RDMA message into a valid XDR stream, and that DDP-eligibility rules specified by the ULB are respected.
There is no provision to express DDP-eligibility within the XDR language. The only definitive specification of DDP-eligibility is a ULB.
In general, a DDP-eligibility violation occurs when:
When expecting small and moderately-sized Replies, a requester should typically rely on Message Continuation rather than provisioning a Reply chunk. For each ULP procedure where there is no clear Reply size maximum and the maximum can be large, the ULB should specify a dependable means for determining the maximum Reply size.
There may be other details provided in a ULB.
Each ULB needs to be designed to allow correct interoperation without regard to the transport parameters actually in use. Furthermore, implementations of ULPs must be designed to interoperate correctly regardless of the connection parameters in effect on a connection.
An RPC Program and Version tuple may be extensible. For instance, there may be a minor versioning scheme that is not reflected in the RPC version number, or the ULP may allow additional features to be specified after the original RPC Program specification was ratified. ULBs are provided for interoperable RPC Programs and Versions by extending existing ULBs to reflect the changes made necessary by each addition to the existing XDR.
This Appendix is not addressed to protocol implementers, but rather to authors of documents that intend to extend the protocol described earlier in this document.
Subsequent RPC-over-RDMA versions are free to change the protocol in any way they choose as long as they leave unchanged those fields identified as "fixed for all versions" in Section 4.2.1 of [RFC8166].
Such changes might involve deletion or major re-organization of existing transport headers. However, the need for interoperability between adjacent versions will often limit the scope of changes that can be made in a single version.
In some cases it may prove desirable to transition to a new version by using the extension features described for use with RPC-over-RDMA version 2, by continuing the same basic extension model but allowing header types and properties that were OPTIONAL in one version to become REQUIRED in the subsequent version.
RPC-over-RDMA version 2 is designed to be extensible in a way that enables the addition of OPTIONAL features that may subsequently be converted to REQUIRED status in a future protocol version. The protocol may be extended by Standards Track documents in a way analogous to that provided for Network File System Version 4 as described in [RFC8178].
This form of extensibility enables limited extensions to the base RPC-over-RDMA version 2 protocol presented in this document so that new optional capabilities can be introduced without a protocol version change, while maintaining robust interoperability with existing RPC-over-RDMA version 2 implementations. The design allows extensions to be defined, including the definition of new protocol elements, without requiring modification or recompilation of the existing XDR.
A Standards Track document introduces each set of such protocol elements. Together these elements are considered an OPTIONAL feature. Each implementation is either aware of all the protocol elements introduced by that feature or is aware of none of them.
Documents describing extensions to RPC-over-RDMA version 2 should contain:
Implementers combine the XDR descriptions of the new features they intend to use with the XDR description of the base protocol in this document. This may be necessary to create a valid XDR input file because extensions are free to use XDR types defined in the base protocol, and later extensions may use types defined by earlier extensions.
The XDR description for the RPC-over-RDMA version 2 base protocol combined with that for any selected extensions should provide an adequate human-readable description of the extended protocol.
The base protocol specified in this document may be extended within RPC-over-RDMA version 2 in two ways:
The following sorts of ancillary protocol elements may be added to the protocol to support the addition of new transport properties and header types.
New capabilities can be proposed and developed independently of each other, and implementers can choose among them. This makes it straightforward to create and document experimental features and then bring them through the standards process.
New transport header types are to defined in a manner similar to the way existing ones are described in Sections 6.4.1 through 6.4.4. Specifically what is needed is:
In addition, there needs to be additional documentation that is made necessary due to the Optional status of new transport header types.
New flag bits are to defined in a manner similar to the way existing ones are described in Sections 6.3.2.1 and 6.3.2.2. Each new flag definition should include:
In addition, there needs to be additional documentation that is made necessary due to the Optional status of new transport header types.
The set of transport properties is designed to be extensible. As a result, once new properties are defined in standards track documents, the operations defined in this document may reference these new transport properties, as well as the ones described in this document.
A standards track document defining a new transport property should include the following information paralleling that provided in this document for the transport properties defined herein.
The definition of transport property structures is such as to make it easy to assign unique values. There is no requirement that a continuous set of values be used and implementations should not rely on all such values being small integers. A unique value should be selected when the defining document is first published as an internet draft. When the document becomes a standards track document, the working group should ensure that:
Documents defining new properties fall into a number of categories.
When additional transport properties are proposed, the review of the associated standards track document should deal with possible security issues raised by those new transport properties.
New error codes to be returned when using new header types may be introduced in the same Standards Track document that defines the new header type. Cases in which a new error code is to be returned by an existing header type can be accommodated by defining the new error code in the same Standards Track document that defines the new transport property.
For error codes that do not require that additional error information be returned with them, the existing RDMA_ERR2 header can be used to report the new error. The new error code is set as the value of rdma_err with the result that the default switch arm of the rpcrdma2_error (i.e. void) is selected.
For error codes that do require the return of additional error-related information together with the error, a new header type should be defined for the purpose of returning the error together with needed additional information. It should be documented just like any other new header type.
When a new header type is sent, the sender needs to be prepared to accept header types necessary to report associated errors.
This section describes the substantive changes made in RPC-over-RDMA version 2.
There are a number of structural XDR changes whose goal is to enable within-version protocol extensibility.
The RPC-over-RDMA version 1 transport header is defined as a single XDR object, with an RPC message proper potentially following it. In RPC-over-RDMA version 2, as described in Section 6.1 there are separate XDR definitions of the transport header prefix (see Section 6.3.2 which specifies the transport header type to be used, and the specific transport header, defined within one of the subsections of Section 6). This is similar to the way that an RPC message consists of an RPC header (defined in [RFC5531]) and an RPC request or reply, defined by the Upper-Layer protocol being conveyed.
As a new version of the RPC-over-RDMA transport protocol, RPC-over-RDMA version 2 exists within the versioning rules defined in [RFC8166]. In particular, it maintains the first four words of the protocol header as sent and received, as specified in Section 4.2 of [RFC8166], even though, as explained in Section 6.3.1 of this document, the XDR definition of those words is structured differently.
Although each of the first four words retains its semantic function, there are important differences of field interpretation, besides the fact that the words have different names and different roles with the XDR constrict of they are parts.
Beyond conforming to the restrictions specified in [RFC8166], RPC-over-RDMA version 2 tightly limits the scope of the changes made in order to ensure interoperability. It makes no major structural changes to the protocol, and all existing transport header types used in version 1 (as defined in [RFC8166]) are retained in version 2. Chunks are expressed using the same on-the-wire format and are used in the same way in both versions.
RPC-over-RDMA version 2 provides a mechanism for exchanging the transport's operational properties. This mechanism allows connection endpoints to communicate the properties of their implementation at connection setup. The mechanism could be expanded to enable an endpoint to request changes in properties of the other endpoint and to notify peer endpoints of changes to properties that occur during operation. Transport properties are described in Section 5.
RPC-over-RDMA transports employ credit-based flow control to ensure that a requester does not emit more RDMA Sends than the responder is prepared to receive. Section 3.3.1 of [RFC8166] explains the purpose and operation of RPC-over-RDMA version 1 credit management in detail.
In the RPC-over-RDMA version 1 design, each RDMA Send from a requester contains an RPC Call with a credit request, and each RDMA Send from a responder contains an RPC Reply with a credit grant. The credit grant implies that enough Receives have been posted on the responder to handle the credit grant minus the number of pending RPC transactions (the number of remaining Receive buffers might be zero).
In other words, each RPC Reply acts as an implicit ACK for a previous RPC Call from the requester, indicating that the responder has posted a Receive to replace the Receive consumed by the requester's RDMA Send. Without an RPC Reply message, the requester has no way to know that the responder is properly prepared for subsequent RPC Calls.
Aside from being a bit of a layering violation, there are basic (but rare) cases where this arrangement is inadequate:
Typically, the connection must be replaced in these cases. This resets the credit accounting mechanism but has an undesirable impact on other ongoing RPC transactions on that connection.
Because credit management accompanies each RPC message, there is a strict one-to-one ratio between RDMA Send and RPC message. There are interesting use cases that might be enabled if this relationship were more flexible:
Bi-directional RPC operation also introduces an ambiguity. If the RPC-over-RDMA message does not carry an RPC message, then it is not possible to determine whether the sender is a requester or a responder, and thus whether the rdma_credit field contains a credit request or a credit grant.
A more sophisticated credit accounting mechanism is provided in RPC-over-RDMA version 2 in an attempt to address some of these shortcomings. This new mechanism is detailed in Section 4.3.1.
The term "inline threshold" is defined in Section 3.3.2 of [RFC8166]. An "inline threshold" value is the largest message size (in octets) that can be conveyed on an RDMA connection using only RDMA Send and Receive. Each connection has two inline threshold values: one for messages flowing from client-to-server (referred to as the "client-to-server inline threshold") and one for messages flowing from server-to-client (referred to as the "server-to-client inline threshold"). Note that [RFC8166] uses somewhat different terminology. This is because it was written with only forward-direction RPC transactions in mind.
A connection's inline thresholds determine when RDMA Read or Write operations are required because the RPC message to be sent cannot be conveyed via a single RDMA Send and Receive pair. When an RPC message does not contain DDP-eligible data items, a requester can prepare a Long Call or Reply to convey the whole RPC message using RDMA Read or Write operations.
RDMA Read and Write operations require that each data payload resides in a region of memory that is registered with the RNIC. When an RPC is complete, that region is invalidated, fencing it from the responder. Memory registration and invalidation typically have a latency cost that is insignificant compared to data handling costs. When a data payload is small, however, the cost of registering and invalidating the memory where the payload resides becomes a relatively significant part of total RPC latency. Therefore the most efficient operation of RPC-over-RDMA occurs when explicit RDMA Read and Write operations are used for large payloads, and are avoided for small payloads.
When RPC-over-RDMA version 1 was conceived, the typical size of RPC messages that did not involve a significant data payload was under 500 bytes. A 1024-byte inline threshold adequately minimized the frequency of inefficient Long messages.
With NFS version 4.1 [RFC5661], the increased size of NFS COMPOUND operations resulted in RPC messages that are on average larger and more complex than previous versions of NFS. With 1024-byte inline thresholds, RDMA Read or Write operations are needed for frequent operations that do not bear a data payload, such as GETATTR and LOOKUP, reducing the efficiency of the transport.
To reduce the need to use Long messages, RPC-over-RDMA version 2 increases the default size of inline thresholds. This also increases the maximum size of reverse-direction RPC messages.
In addition to a larger default inline threshold, RPC-over-RDMA version 2 introduces Message Continuation. Message Continuation is a mechanism that enables the transmission of a data payload using more than one RDMA Send. The purpose of Message Continuation is to provide relief in several important cases:
For general operation of NFS on open networks, we eventually intend to rely on RPC-on-TLS [citation needed] to provide cryptographic authentication of the two ends of each connection. In turn, this will improve the trustworthiness of AUTH_SYS-style user identities that flow on TCP, which are not cryptographic. We do not have a similar solution for RPC-over-RDMA, however.
Here, the RDMA transport layer already provides a strong guarantee of message integrity. On some network fabrics, IPsec can be used to protect the privacy of in-transit data, or TLS itself could be used for transporting raw RDMA operations. However, this is not the case for all fabrics (e.g., InfiniBand [IBA]).
Thus, it is sensible to add a mechanism in the RPC-over-RDMA transport itself for authenticating the connection peers. This mechanism is described in Section 5.2.6. And like GSS channel binding, there should also be a way to determine when the use of host authentication is superfluous and can be avoided.
An STag that is registered using the FRWR mechanism in a privileged execution context or is registered via a Memory Window in an unprivileged context may be invalidated remotely [RFC5040]. These mechanisms are available when a requester's RNIC supports MEM_MGT_EXTENSIONS.
For the purposes of this discussion, there are two classes of STags. Dynamically-registered STags are used in a single RPC, then invalidated. Persistently-registered STags live longer than one RPC. They may persist for the life of an RPC-over-RDMA connection, or longer.
An RPC-over-RDMA requester may provide more than one STag in one transport header. It may provide a combination of dynamically- and persistently-registered STags in one RPC message, or any combination of these in a series of RPCs on the same connection. Only dynamically-registered STags using Memory Windows or FRWR (i.e., registered via MEM_MGT_EXTENSIONS) may be invalidated remotely.
There is no transport-level mechanism by which a responder can determine how a requester-provided STag was registered, nor whether it is eligible to be invalidated remotely. A requester that mixes persistently- and dynamically-registered STags in one RPC, or mixes them across RPCs on the same connection, must therefore indicate which handles may be invalidated via a mechanism provided in the Upper-Layer Protocol. RPC-over-RDMA version 2 provides such a mechanism.
The RDMA Send With Invalidate operation is used to invalidate an STag on a remote system. It is available only when a responder's RNIC supports MEM_MGT_EXTENSIONS, and must be utilized only when a requester's RNIC supports MEM_MGT_EXTENSIONS (can receive and recognize an IETH).
Existing RPC-over-RDMA transport protocol specifications [RFC8166] [RFC8167] do not forbid direct data placement in the reverse direction, even though there is currently no Upper-Layer Protocol that makes data items in reverse direction operations elegible for direct data placement.
When chunks are present in a reverse direction RPC request, Remote Invalidation allows the responder to trigger invalidation of a requester's STags as part of sending a reply, the same way as is done in the forward direction.
However, in the reverse direction, the server acts as the requester, and the client is the responder. The server's RNIC, therefore, must support receiving an IETH, and the server must have registered the STags with an appropriate registration mechanism.
RPC-over-RDMA version 2 expands the repertoire of errors that may be reported by connection endpoints. This change, which is structured to enable extensibility, allows a peer to report overruns of specific resources and to avoid requester retries when an error is permanent.
The authors gratefully acknowledge the work of Brent Callaghan and Tom Talpey on the original RPC-over-RDMA version 1 specification (RFC 5666). The authors also wish to thank Bill Baker, Greg Marsden, and Matt Benjamin for their support of this work.
The XDR extraction conventions were first described by the authors of the NFS version 4.1 XDR specification [RFC5662]. Herbert van den Bergh suggested the replacement sed script used in this document.
Special thanks go to Transport Area Director Magnus Westerlund, NFSV4 Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4 Working Group Secretary Thomas Haynes for their support.