Network File System Version 4 D. Noveck
Internet-Draft HPE
Intended status: Standards Track December 3, 2016
Expires: June 6, 2017

RPC-over-RDMA Extensions to Reduce Internode Round-trips
draft-dnoveck-nfsv4-rpcrdma-rtrext-01

Abstract

It is expected that a future version of the RPC-over-RDMA transport will allow protocol extensions to be defined. This would provide for the specification of OPTIONAL features allowing participants who implement such features to cooperate as specified by that extension, while still interoperating with participants who do not support that extension.

A particular extension is described herein, whose purpose is to reduce the latency due to inter-node round-trips needed to effect operations which involve direct data placement or which transfer RPC messages longer than the fixed inline buffer size limit.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on June 6, 2017.

Copyright Notice

Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.


Table of Contents

1. Preliminaries

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].

1.2. Introduction

This document describes a potential extension to the RPC-over-RDMA protocol, which would allow participating implementations to have more flexibility in how they use RDMA sends and receives to effect necessary transmission of RPC requests and replies.

In contrast to existing facilities defined in RPC-over-RDMA Version One in which the mapping between RPC messages and RPC-over-RDMA messages is strictly one-to-one and DDP is effected only through use of explicit RDMA operations, the following features are made available through this extension:

1.3. Prerequisites

This document is written assuming that certain underlying facilities will be made available to build upon, in the context of a future version of RPC-over-RDMA. It is most likely that such facilities will be first available in Version Two of RPC-over-RDMA.

As the document referred to above is currently a personal Internet Draft, and subject to change, adjustments to this document are expected to be necessary when and if the needed facilities are defined in one or more working group documents.

1.4. Role Terminology

A number of different terms are used regarding the roles of the two participants in an RPC-over-RMA connection. Some of these roles last for the duration of a connection while others vary from request to request or from message to message.

The roles of the client and server are fixed for the lifetime of the connection, with the client defined as the endpoint which initiated the connection.

The roles of requester and responder often parallel those of client and server, although this is not always the case. Most requests are made in the forward direction, in which the client is the requester and the server is the responder. However, backward direction requests are possible, in which case the server is the requester and the client is the responder. As a result clients and servers may both act as requesters and responders for different requests issued on the same connection.

The roles of sender and receiver vary from message to messages. With regard to the messages described in this document, the sender may act as a requester by sending RPC requests or a responder by sending RPC requests or as both at the same time by sending a mix of the two.

2. Extension Overview

This extension is intended to function as part of RPC-over-RDMA and implementations should successfully interoperate with existing RPC-over-RDMA Version One implementations. Nevertheless, this extension seeks to take a somewhat different approach to high-performance RPC operation than has been used previously in that it seeks to de-emphasize the use of explicit RDMA operations. It does this in two ways:

While use of explicit RDMA operations allows the cost of the actual data transfer to be offloaded from the client and server CPUs to the RNIC, there are ancillary costs in setting up the transfer that cannot be ignored. As a result, send-based functions are often preferable, since the RNIC also uses DMA to effect these operations. In addition, the cost of the additional inter-node round trips required by explicit RDMA operation can be an issue, which can becomes increasingly troublesome as internode distances increase. Once one moves from in-machine-room to campus-wide or metropolitan-area distances the additional round-trip delay of 16 microseconds per mile becomes an issue impeding use of explicit RDMA operations.

3. Direct Data Placement Features

3.1. Current Situation

Although explicit RDMA operations are used in the existing RPC-over-RDMA protocol for purposes unrelated to Direct Data Placement, all DDP is effected using explicit RDMA operations.

As a result, many operations involving Direct Data Placement involve multiple internode round trips.

3.2. RDMA_MSGP

Although this was not stated explicitly, it appears that RDMA_MSGP (defined in [RFC5666], removed from RPC-over-RDMA Version One by [rfc5666bis]), was an early attempt to effect correct placement of bulk data within a single RPC-over-RDMA transmission.

As things turned out, the fields within the RDMA_MSGP header were not described in [RFC5666] in a way that allowed this message type to be implemented.

In attempting to provide DDP functionality, we have to keep in mind and avoid the problems that led to failure of RDMA_MSGP. It appears that the problems go deeper than neglecting to write a few relevant sentences. It is helpful to note that:

To summarize, RDMA_MSGP was an attempt to properly place data which was thought of as a local optimization and insufficient attention was given to it to make it successful. As a result, as RPC-over-RDMA Version One was developed, Direct Data Placement was identified with the use of explicit RDMA operations, and the possibility of Data Placement within sends was not recognized.

3.3. Send-based DDP

In this exension we will describe a more cmplete way to provide send-based data placement, as follows:

3.4. Other DDP-Related Extensions

In order to support send-based DDP, new DDP-related data structures have been defined, as described in Sections 7.3 and 7.4.

These new data structures support both send-based and RDMA-operation-based DDP. In addition, because of the restructuring described in Section 7.1, a number of additional facilities are made available:

These additional facilities will be available to implementations that do not support send-based DDP, as long as both parties support the OPTIONAL Header types that include these new structures. For more information about the relationships among, the new transport properties, operations, and features, see Section 5.

4. Message Continuation Feature

4.1. Current Situation

Within RPC-over-RDMA Version One [rfc5666bis], each transmission of a request or reply involves sending a single RDMA send message and conversely each message-related transmission involves only a single RPC request or reply.

This strict one-to-one model leads to some potential performance issues.

4.2. Message Continuation Changes

Continuing a single RPC request or reply is addressed by defining separate optional header types to begin and to continue sending a single RPC message. This is instead of creating a header with a continuation bit. In this approach, all of the DDP-related fields, which include support for send-based DDP, appear in the starting header (of types ROPT_XMTREQ and ROPT_XMTRESP) and apply to the RPC message as a whole.

Later RPC-over-RDMA messages (of type ROPT_XMTCONT) may extend the payload stream and/or provide additional buffers to which bulk data can be directed.

In this case, all of the RPC-over-RDMA messages used together are referred to as a transmission group and must be received in order without any intervening message.

In implementations using this optional facility, those decoding RPC messages received using RPC-over-RDMA no longer have the assurance that that each RPC message is in a contiguous buffer. As most XDR implementations are built based on the assumption that input will not be contiguous, this will not affect performance in most cases.

4.3. Message Continuation and Credits

Using multiple transmissions to send a single request or response can complicate credit management. In the case of the message continuation feature, deadlocks can be avoided because use of message continuation is not obligatory. The requester or responder can use explicit RDMA operations if sufficient credits to use message continuation are not available.

A requester is well positioned to make this choice with regard to the sending of requests. The requester must know, before sending a request, how long it will be, and therefore, how many credits it would require to send the request using message continuation. If these are not available, it can avoid message continuation by either creating read chunks sufficient to make the payload stream fit in a single transmission or by creating a position-zero read chunk.

With regard to the response, the requester is not in position to know exactly how long the response will be. However, the ULB will allow the maximum response length to be determined based on the request. This value can be used:

The requester can avoid doing the second of these if the responder has indicated it can use message continuation to send the response. In this case, it makes sure that the buffers will be available and indicates to the responder how many additional buffers (in the form of pre-posted reads have been made available to accommodate continuation transmissions.

When the responder processes the request, those additional receive buffers may be used or not, or used only in part. This may be because the response is shorter than the maximum possible response, or because a reply chunk was used to transmit the response.

After the first or only transmission associated with the response is received by the requester, it can be determined how many of the additional buffers were used for the response. Any unused buffers can be made available for other uses such as expanding the pool of receive buffers available for the initial transmissions of response or for receiving opposite direction requests. Alternatively, they can be kept in reserve for future uses, such as being made available to future requests which have potentially long responses.

5. Using Protocol Additions

In using existing RPC-over-RDMA facilities for protocol extension, interoperability with existing implementations needs to be assured. Because this document describes support for multiple features, we need to clearly specify the various possible extensions and how peers can determine whether certain facilities are supported by both ends of the connection.

5.1. New Operation Support

Note that most of the new operations defined in this extension are not tightly tied to a specific feature. XOPT_XMTREQ and XOPT_XMTRESP are designed to support implementations that support either or both Send-based DDP or message continuation. However, the converse is not the case and these header types can be implemented by those not supporting either of these features. For example, implementations may only need support for the facilities described in Section 3.4.

Implementations may determine whether a peer implementation supports XOPT_XMTREQ, XOPT_XMTREQ, or XOPT_XMTCONT by attempting these operations. An alternative is to interrogate the RTR Support Property for information about which operations are supported.

5.2. Message Continuation Support

Implementations may determine and act based on the level of peer implementation of support for message continuation as follows:

5.3. Send-based DDP Support

Implementations may determine and adapt to the level of peer implementation support for send-based DDP as described below. Note that an implementation may be able to send messages containing bulk data items placed using send-based DDP while not being prepared to receive them, or the reverse.

In determining whether bulk data will be placed using send-based DDP or via explicit RDMA operations, the level of support for message continuation will have a role. This is because DDP using explicit RDMA will reduce message size while send-based DDP reduces the size of the payload stream by rearranging the message, leaving the message size the same. As a result, the considerations discussed in Section 4.3 will have to be attended to by the sender in determining which form of DDP is to be used.

5.4. Error Reporting

The more extensive transport layer functionality described in this document requires its own means of reporting errors, to deal with issues that are distinct from:

Beyond the above, the following sorts of errors will have to be dealt with, depending on which of the features of the extension are implemented.

In each of the above cases, the problem will be reported to the sender using the Error Reporting operation which needs to be supported by every endpoint that sends ROPT_XMTREQ, ROPT_XMTRESP, or ROPT_XMTCONT. This includes cases in which the problem is one with a reply. The function of the Error Reporting operation is to aid in diagnosing transport protocol errors and allowing the sender to recover or decide recovery is not possible. Reporting failure to the requesting process is dealt with indirectly. For example,

6. XDR Preliminaries

6.1. Message Continuation Preliminaries

<CODE BEGINS>

typedef uint32  xms_grpxn;
typedef uint32  xms_grpxc;
struct xms_id {
        uint32         xmsi_xid;
        msg_type       xmsi_dir;
        xms_grpxn      xmsi_seq;
} 

<CODE ENDS>
          

In order to implement message continuation, we have occasion to refer to particular RPC-over-RDMA transmissions within a transmission group or to characteristics of a later transmission group.

An xms_grpxn designates a particular RPC-over-RDMA transmission within a set of transmissions devoted to sending a single RPC message.

An xms_grpxc specifies the number of RPC-over-RDMA transmissions in a potential group of transmissions devoted to sending a single RPC message.

6.2. Data Placement Preliminaries

<CODE BEGINS>

typedef uint32  xmddp_itemlen;
typedef uint32  xmddp_pldisp;
typedef uint32  xmddp_vsdisp;

typedef uint32  xmddp_tbsn;
 
enum xmddp_type {
        XMDTYPE_EXRW = 1,
        XMDTYPE_TBSN = 2,
        XMDTYPE_CHOOSE = 3,
        XMDTYPE_BYSIZE = 4,
        XMDTYPE_TOOSHORT = 5,
        XMDTYPE_NOITEM = 6
};
                
<CODE ENDS>
          

Data structures related to data placement use a number of XDR typedefs to help clarify the meaning of fields in the data structures which use these typedefs.

An xmddp_itemlen specifies the length of XDR item. Because items excised from the XDR stream are XDR items, lengths of items excised from the XDR stream are denoted by xmddp_itemlens.

An xmddp_pldisp specifies a specific displacement with the payload stream associated with a single RPC-over-RDNA transmission or a group of such transmissions. Note that when multiple transmissions are used for a single message, all of the payload streams within a transmission group are considered concatenated.

An xmddp_vsdisp specifies a displacement within the virtual XDR stream associates with the set of RPC messages transferred by single RPC-over-RDNA transmission or a group of such transmissions. The virtual XDR stream includes bulk data excised from the payload stream and so displacements within it reflect those of the corresponding objects in the XDR stream that might be sent and received if no bulk data excision facilities were involved in the RPC transmission.

An xmddp_tbsn designates a particular target buffer segment within a (trivial or non-trivial) RPC-over-RDMA transmission group. Each DDP-targetable buffer segment is assigned a number starting with zero and proceeding through all the buffer segments for all the RPC-over-RDMA transmissions in the group. This includes buffer segments not actually used because transmission are shorter than the maximum size and those in which a DDP-targetable buffer segment is used to hold part of the payload XDR stream rather than bulk data.

An xmddp_type allows a selection between DDP using explicit RDMA operations and that using send-based DDP. It is used in a number of contexts. The specific context governs which subset of the types is valid:

A number of these type are valid in all of these contexts, since they specify use of a specific mode of direct placement which is to be used or has been used.

Another set of types is used to direct the use of specific sets of types but cannot specify an actual choice that has been made.

The following types are used when no actual direct placement has occurred. They are used in responses to indicate ways in which a direction to govern DDP in a reply was responded to without resulting in direct placement.

The following table indicates which of the above types is valid in each of the contexts in which these types may appear. For valid occurrences, it distinguishes those which give sender-generated information about the message, and those that direct reply construction, from those that indicate how those directions governed the construction of a reply. For invalid occurrences, we distinguish between those that result in XDR decode errors and those which are valid from the XDR point of view but are semantically invalid.

Type xmddp_loc in request xmddp_rsdloc in request xmddp_loc in response
XMDTYPE_EXRW Valid Info Valid Direction Valid Result
XMDTYPE_TBSN Valid Info Valid Direction Valid Result
XMDTYPE_BYSIZE XDR Invalid Valid Direction XDR Invalid
XMDTYPE_CHOICE XDR Invalid Valid Direction XDR Invalid
XMDTYPE_TOOSHORT Sem. Invalid XDR Invalid Valid Result
XMDTYPE_NOITEM Sem. Invalid XDR Invalid Valid Result

7. Data Placement Structures

7.1. Data Placement Overview

To understand the new DDP structure defined here, it is necessary to review the existing DDP structures used in RPC-over-RDMA Version One and look at the corresponding structures in the new message transmission headers defined in this document.

We look first at the existing structures.

Within the DDP structures defined here a different organization is used, even where DDP using explicit RDMA operations in supported.

Both sets of data structure are defined at the granularity of an RPC-over-RDMA transmission group. That is, they describe the placement of data within an RPC message and the scope of description is not limited to a single RPC-over-RDMA transmission.

7.2. Buffer Structure Definition

Buffer structure definition information is used to allow the sender to know how receive buffers are constructed, to allow it to appropriately pad messages being sent so that bulk data will be received into a memory area with the appropriate characteristics.

In this case, Direct Data Placement will not place data in a specific address, picked and registered in advance as is done to effect DDP using explicit RDMA operations. Instead, a message is sent so that when it is matched with one of the preposted receives, the bulk data will be received into a memory area with the appropriate characteristics, including:

<CODE BEGINS>

struct xmrbs_seg {
        uint32          xmrseg_length;
        uint32          xmrseg_align;
        uint32          xmrseg_flags;
};

const uint32    XMRSFLAG_DDP = 0x01;


struct xmrbs_group {
        uint32          xmrgrp_count;
        xmrbs_seg       xmrgrp_info;
};

struct xmrbs_buf {
        uint32          xmrbuf_length;
        xmrbs_group     xmrbuf_groups<>;
};

<CODE ENDS>
          

Buffers can be, and typically are, structured to contain multiple segments. Preposted receives that target a buffer uses a scatter list to place received messages in successive buffer segments.

An xmrbs_seg defines a single buffer segment. The fields included are:

The following flag bit is the only one currently defined:

An xmrgs_group designates a set of buffer segment all with the same buffer segment characteristics as indicated by xmr_grpinfo. The buffer segments are contiguous within the buffer although they are likely not to be physically contiguous.

An xmrbs_buf defines a receiver's buffer structure and consists of multiple xmrbs_groups. This buffer structure, when made available as a transport property, allows the sender to structure transmissions so as to place DDP-eligible data in appropriate target buffer segments.

7.3. Message DDP Structures

<CODE BEGINS>

union xmddp_loc switch(xmddp_type type)

        case XMDTPE_EXRW:
                rpcrdma1_segment        xmdl_ex<>;
        case XMDTYPE_TBSN:
                xmddp_itemlen           xmdl_offset;
                xmddp_tbsn              xmdl_bsnum<>;
        case XMDTYPE_TOOSHORT:
        case XMDTYPE_NOITEM:
                void;
};


struct xmddp_mitem {
        xmddp_vsdisp    xmdmi_disp;
        xmddp_itemlen   xmdmi_length;
        xmddp_loc       xmdmi_where;
};

typedef xmddp_mitem     xmddp_grpinfo<>;
         
<CODE ENDS>
          

These data structures show where in the virtual XDR stream for the set of messages, data is to be excised from that XDR stream and where that excised bulk data is to be found instead.

An xmddp_loc shows where a particular piece of bulk data is located. This information exists in multiple forms.

An xmddp_mitem denotes a specific item of bulk data. It consists of:

An xmddp_grpinfo consists of an array of xmddp_mitems describing all of the bulk data excised from all RPC messages sent in a single RPC-over-RDMA transmission group. Some possible cases:

7.4. Response Direction DDP Structures

<CODE BEGINS>

union xmddp_rsdloc switch(xmddp_type type)

        case XMDTPE_EXRW:
        case XMDTPE_CHOICE:
                rpcrdma1_segment        xmdrsdl_ex<>;
        case XMDTPE_BYSIZE:
                xmddp_itemlen           xmdrsdl_dsdov;
                rpcrdma1_segment        xmdrsdl_bsex<>;
        case XMDTYPE_TBSN:
                void;
};

struct xmddp_rsdrange {
        xmddp_vsdisp    xmdrsdr_begin;
        xmddp_vsdisp    xmdrsdr_end;
};

struct xmddp_rsditem {
        xmddp_itemlen   xmdrsdi_minlen;
        xmddp_rsdloc    xmdrsdi_loc;
};

struct xmddp_rsdset {
        xmddp_rsdrange  xmdrsds_range;
        xmddp_rsditem   xmdrsds_items<>;
};

typedef xmddp_rsdset    xmddp_rsdgroup<>;

<CODE ENDS>
          

These data structures, when sent as part of the request, instruct the responder how to use Direct Data Placement to place response data subject to direct data placement.

An xmddp_rsdloc contains information specifying where bulk data generated as part of a reply is to be placed. This information is defined as a union with the following cases:

In all cases, each xmddp_rsdloc sent as part of a request has a corresponding xmddp_loc in the associated response. The xmddp_type specified in the request will affect the type in the response, but the types are not necessarily the same. The table below describes the valid combinations of request and response xmddp_type values.

In this table, rows correspond to types in requests directing, the responder as to the desired placement in the response while the columns correspond to types in the ensuing response. Invalid combinations are labelled "Inv" while valid combination are labelled either "NDR" denoting no need to deregister memory, or "DR" to indicate that memory previously registered will need to be deregistered.

Type EXRW TBSN TOOSHORT NOITEM
EXRW DR Inv. DR DR
TBSN Inv. NDR NDR NDR
CHOICE DR NDR DR DR
BYSIZE DR NDR DR DR

An xmddp_rsdrange denotes a range of positions in the XDR stream associated with a request. Particular directions regarding bulk data in the corresponding response are limited to such ranges, where response XDR stream positions and request XDR stream positions can be reliably tied together.

When the ULP supports multiple individual operations per RPC request (e.g., COMPOUND and CB_COMPOUND in NFSv4), an xmd_rsdrange can isolate elements of the reply due to particular operations.

An xmddp_rsditem specifies the handling of one potential item of bulk data. The handling specified is qualified by a length range. If the item is smaller than xmdrsdi_minlen, it is not treated as bulk data and the corresponding data item appears in the payload stream, while that particular xmddp_rsditem is considered used up, making the next xmddp_rsditem in the xmddp_rsdset the target of the next DDP-eligible data item in the reply. Note that in the case in which xmdrsdi_loc specifies use of explicit RDMA operations, the area specified is not used and the requester is responsible for deregistering it.

For each xmddp_rsditem, there will be a corresponding xmddp_mitem

An xmddp_rsdset contains a set of xmddp_rsditems applicable to a given xmddp_range in the request.

An xmddp_rsdgroup designates a set of xmddp_rsdsets applicable to a particular RPC-over-RDMA transmission group. The xmdrsds_range fields of successive xmddp_rsdsets must be disjoint and in strictly increasing order.

8. Transport Properties

8.1. Property List

In this document we take advantage of the fact that the set of transport properties defined in [rpcrdmav2] is subject to later extension. The additional transport properties are summarized below in Table 3.

In that table the columns have the following values:

  • The column labeled "property" identifies the transport property described by the current row.
  • The column labeled "#" specifies the propid value used to identify this property.
  • The column labeled "XDR type" gives XDR type of the data used to communicate the value of this property. This data overlays the nominally opaque field pv_data in a propval.
  • The column labeled "default" gives the default value for the property which is to be assumed by those who do not receive, or are unable to interpret, information about the actual value of the property.
  • The column labeled "section" indicates the section (within this document) that explains the semantics and use of this transport property.

property # XDR type default section
RTR Support 3 uint32 0 8.2
Receive Buffer Structure 4 xmrbs_buf Note1 [cNote1] 8.3
Request Transmission Receive Limit 5 xms_grpxc 1 8.4
Response Transmission Send Limit 6 xms_grpxc 1 8.5

The following notes apply to the above table:

  1. The default value for the Receive Buffer Structure always consists of a single buffer segment, without any alignment restrictions and not targetable for DDP. The length of that buffer segment derives from the Receive Buffer Size Property if available, and from the default receive buffer size otherwise.

8.2. RTR Support Property

<CODE BEGINS>

const uint32           XPROP_RTRSUPP = 3;
typedef uint32         xpr_rtrs;

const uint32           RTRS_XREQ = 1;
const uint32           RTRS_XRESP = 2;
const uint32           RTRS_XCONT = 4;

<CODE ENDS>
          

8.3. Receive Buffer Structure Property

<CODE BEGINS>

const uint32           XPROP_RBSTRUCT = 4;
typedef xmrbs_buf      xpr_rbs;

<CODE ENDS>
          

This property defines the structure of the endpoint's receive buffers, in order to give a sender the ability to place bulk data in specific DDP-targetable buffer segments.

Normally, this property, if specified, should be in agreement with Receive Buffer Size Property. However, the following rules apply.

  • If the value of Receive Buffer Structure Property is not specified, it is derived from the Receive Buffer Size Property, if known, and the default buffer size otherwise. The buffer is considered to consist of a single non-DDP-targetable segment whose size is the buffer size.
  • If the value of Receive Buffer Size Property is not specified and the Receive Buffer Structure Property is specified, the value of the former is derived from the latter, by adding up the length of all buffer segments specified.

8.4. Request Transmission Receive Limit Property

<CODE BEGINS>

const uint32           XPROP_REQRXLIM = 5;
typedef uint32         xpr_rqrxl;

<CODE ENDS>
         

This property specifies the length of the longest request messages (in terms of number of transmissions) that a responder will accept.

A requester can use this property to determine whether to send long requests by using message continuation or by using a position-zero read chunk.

8.5. Response Transmission Send Limit Property

<CODE BEGINS>

const uint32           XPROP_RESPSXLIM = 6;
typedef uint32         xpr_rssxl;

<CODE ENDS>
          

This property specifies the length of the longest response message (in terms of number of transmissions) that a responder will generate.

9. New Operations

9.1. Operations List

The proposed new operation are set for in Table 4 below. In that table, the columns have the following values:

  • The column labeled "operation" specifies the particular operation.
  • The column labeled "#" specifies the value of opttype for this operation.
  • The column labeled "XDR type" gives XDR type of the data structure used to describe the information in this new message type. This data overlays the nominally opaque field optinfo in an RDMA_OPTIONAL message.
  • The column labeled "msg" indicates whether this operation is followed (or not) by an RPC message payload (or something else).
  • The column labeled "section" indicates the section (within this document) that explains the semantics and use of this optional operation.

operation # XDR type msg section
Transmit Request 5 optxmt_req Note1 [oNote1] 9.2
Transmit Response 6 optxmt_resp Note1 [oNote1] 9.3
Transmit Continue 7 optxmt_cont Note2 [oNote2] 9.4
Report Error 8 optrept_err No. 9.5

The following notes apply to the above table:

  1. Contains an initial segment of the message payload stream for an RPC message, or the entre payload stream. The optxr[qs]_pslen field, indicates the length of the section present
  2. May contain a part of a message payload stream for an RPC message, although not the entre payload stream. The optxc_pslen field, if non-zero, indicates that this portion is present, and the length of the section.

9.2. Transmit Request Operation

<CODE BEGINS>

const uint32     ROPT_XMTREQ = 1;

struct optxmt_req {
        xmddp_grpinfo   optxrq_ddp;
        xmddp_rsdgroup  optxrq_rsd;
        xms_grpxc       optxrq_count;
        xms_grpxc       optxrq_rsbuf;
        xmddp_pldisp    optxrq_pslen;

};        
 
<CODE ENDS>
          

The message definition for this operation is as follows:

The field optxrq_ddp describes the fields in virtual XDR stream which have been excised in forming the payload stream, and information about where the corresponding bulk data is located.

The field optxrq_rsd consists of information directing the responder as to how to construct the reply, in terms of DDP. of length zero.

The field optrq_count specifies the count of transmissions in this group of transmissions used to send a request.

The field optrq_repch serves as a way to transfer a reply chunk to the responder to serve as a way in which a reply longer than the inline size limit may be transferred. Although, not prohibited by the protocol, it is unlikely to be used in environments in which message continuation is supported.

The field optrq_pslen gives the length of the payload stream for the RPC transmitted. The payload stream begins right after the end of the optxmt_msg and proceeds for optxm_pslen bytes. This can include crossing buffer segment boundaries.

9.3. Transmit Response Operation

<CODE BEGINS>

const uint32     ROPT_XMTRESP = 2;

struct optxmt_resp {
        xmddp_grpinfo   optxrs_ddp;
        xms_grpxn       optxrs_count;
        xmddp_pldisp    optxrs_pslen;

};        
 
<CODE ENDS>
          

The message definition for this operation is as follows:

The field optxrs_ddp describes the fields in virtual XDR stream which have been excised in forming the payload stream, and information about where the corresponding bulk data is located.

The field optrs_count specifies the count of transmissions in this group of transmissions used to send a reply.

The field optrq_pslen gives the length of the payload stream for the RPC transmitted. The payload stream begins right after the end of the optxmt_msg and proceeds for optxm_pslen bytes. This can include crossing buffer segment boundaries.

9.4. Transmit Continue Operation

RPC-over-RDMA headers of this type are used to continue RPC messages begun by RPC-over-RDMA message of type ROPT_XMTREQ or ROPT_XMTRESP. The xid field of this message must match that in the initial transmission.

This operation needs to be supported for the message continuation feature to be used.

<CODE BEGINS>

const uint32     ROPT_XMTCONT = 3;

struct optxmt_cont {
        xms_grpxn       optxc_xnum;
        uint32          optxc_itype;
        xmddp_pldisp;   optxc_pslen;
};        
 
<CODE ENDS>
          

The message definition for this operation is as follows:

The field optxc_xnum indicates the transmission number of this transmission within its transmission group.

The field optxc_pslen gives the length of the section of the payload stream which is located in the current RPC-over-RDMA transmission. It is valid for this length to be zero, indicating that there is no portion of the payload stream in this transmission. Except when the length is zero, the payload stream begins right after the end of the optxmt_cont and proceeds for optxc_pslen bytes. This can include crossing buffer segment boundaries. In any case, the payload streams for all transmissions within the same group are considered concatenated.

9.5. Error Reporting Operation

This RPC-over-RDMA message type is used to signal the occurrence of errors that do not involve:

  1. Transmission of a message that violates the rules specified in [rpcrdmav2].
  2. Transmission of a message described in this document which does not conformn to the XDR specfied here.
  3. The transmission of a message, which, when assembled according to the rules here, cannot be decoded according to the XDR for the ULP.

Such errors can arise if the rules specified in this document are not followed and can be the result of a mismatch between multiple, each of which is valid when considered on its own.

<CODE BEGINS>

enum optr_err {
        OPTRERR_BADHMT = 1,
        OPTRERR_BADOMT = 2,
        OPTRERR_BADCONT = 3,
        OPTRERR_BADSEQ = 4,
        OPTRERR_BADXID = 5,
        OPTRERR_BADOFF = 6,
        OPTRERR_BADTBSN = 7,
        OPTRERR_BADPL = 8
}

union optr_info switch(optr_err optre_which) {

  case OPTRERR_BADHMT:
  case OPTRERR_BADOMT:
  case OPTRERR_BADSEQ:
  case OPTRERR_BADXID:
        uint32          optri_expect;
        uint32          optri_current;

  case OPTRERR_BADCONT:
        void;


  case OPTRERR_BADTBSN:
  case OPTRERR_BADOFF:
  case OPTRERR_BADPL:
        uint32          optri_value;
        uint32          optri_min;
        uint32          optri_max;

};

<CODE ENDS>
          

The preliminary error-related definition is as follows:

optr_err enumerates the various error conditions that might be reported.

  • OPTRERR_BADHMT indicates that a header message type other than the one expected was received. In this context, a particular message type can be considered "expected" only because of message or group continuation.
  • OPTRERR_BADOMT indicates that an optional message type other than the one expected was received. In this context, a particular message type can be considered "expected" only because of message or group continuation.
  • OPTRERR_BADCONT indicates that a continuation messages was received when there was no reason to expect one.
  • OPTRERR_BADSEQ indicate that a transmission sequence number other than the one expected was received.
  • OPTRERR_BADXID indicate that an xid other than the one expected in a continuation context.
  • OPTRERR_BADTBSN indicate that an invalid target buffer sequence number was received.
  • OPTRERR_BADOFF indicate that a bad offset was received as part of an xmddp_loc. This is typically because the offset is larger than the buffer segment size.
  • OPTRERR_BADPL indicates that a bad offset was received for the payload length. This is typically because the length would make the area devoted to the payload stream not a subset of the actual transmission.

The optr_info gives error about the specific invalid field being reported. The additional information given depends on the specific error.

  • For the errors OPTRERR_BADHMT, OPTRERR_BADOMT, OPTRERR_BADSEQ, and OPTRERR_BADXID, the expected and actual values of the field are reported
  • For the error OPTRERR_CONT, no additional information is provided.
  • For the errors OPTRERR_BADTBSN, OPTRERR_BADOFF, and OPTRERR_BADPL, the actual value together with a range of valid values is provided. When the actual value is with the valid range, it can be inferred that the actual value is not properly aligned (e.g. not on a 32-bit boundary)

<CODE BEGINS>

const uint32     ROPT_REPTERR = 4;

struct optrept_err {
        xms_id          optre_bad;
        xms_id          *optre_lead;
        optr_info       optre_info;
};        
 
<CODE ENDS>
          

The message definition for this operation is as follows:

The field optre_bad is a description of the transmission on which the error was actually detected.

The optional field optre_lead is a description of an earlier transmission that might have led to the error reported.

The field optre_info provides information about the

10. XDR

This section contains an XDR [RFC4506] description of the proposed extension.


<CODE BEGINS>

#!/bin/sh
grep '^ *///' | sed 's?^ /// ??' | sed 's?^ *///$??'

<CODE ENDS>

       

This description is provided in a way that makes it simple to extract into ready-to-use form. The reader can apply the following shell script to this document to produce a machine-readable XDR description of extension which can be combined with XDR for the base protocol to produce an XDR that includes the base protocol together with the optional extensions.


<CODE BEGINS>

sh extract.sh < ext.txt > xmitext.x

<CODE ENDS>

        

That is, if the above script is stored in a file called "extract.sh" and this document is in a file called "ext.txt" then the reader can do the following to extract an XDR description file for this extension:

The XDR description for this extension can be combined with that for other extensions and that for the base protocol. While this is a complete description and can be processed by the XDR compiler, the result might not be usable to process the extended protocol, for a number of reasons:

  • The RPC-over-RDMA transport headers do not constitute an RPC program and version negotiation and message selection part of the XDR, rather than being external to it.
  • Headers used for requests and replies are not necessarily paired, as they would be in an RPC program.
  • Header types defined as optional extensions overlay existing nominally opaque fields in the base protocol. While this overlay architecture allows code aware of the overlay relationships to have a more complete view of header structure, this overlay relationship cannot be expressed within the XDR language

10.1. Code Component License


<CODE BEGINS>

/// /*
///  * Copyright (c) 2010, 2016 IETF Trust and the persons
///  * identified as authors of the code.  All rights reserved.
///  *
///  * The author of the code is: D. Noveck.
///  *
///  * Redistribution and use in source and binary forms, with
///  * or without modification, are permitted provided that the
///  * following conditions are met:
///  *
///  * - Redistributions of source code must retain the above
///  *   copyright notice, this list of conditions and the
///  *   following disclaimer.
///  *
///  * - Redistributions in binary form must reproduce the above
///  *   copyright notice, this list of conditions and the
///  *   following disclaimer in the documentation and/or other
///  *   materials provided with the distribution.
///  *
///  * - Neither the name of Internet Society, IETF or IETF
///  *   Trust, nor the names of specific contributors, may be
///  *   used to endorse or promote products derived from this
///  *   software without specific prior written permission.
///  *
///  *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS
///  *   AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED
///  *   WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
///  *   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
///  *   FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO
///  *   EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
///  *   LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
///  *   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
///  *   NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
///  *   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
///  *   INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
///  *   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
///  *   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
///  *   IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
///  *   ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
///  */

<CODE ENDS>

          

Code components extracted from this document must include the following license text. When the extracted XDR code is combined with other complementary XDR code which itself has an identical license, only a single copy of the license text need be preserved.

10.2. XDR Proper for Extension



<CODE BEGINS>
/// /*******************************************************************
///  *******************************************************************
///  ** 
///  **  XDR for OPTIONAL protocol extension.
///  ** 
///  **  Includes support for both message continuation and send-based 
///  **  DDP. The latter is supported by a new structure for the 
///  **  specification of data placements which can be used for both 
///  **  send-based DDP and DDP using explicit RDMA operations.
///  ** 
///  **  Extensions include:
///  ** 
///  **     o Four new transport properties.
///  **     o Four new OPTIONAL message types
///  **     
///  *******************************************************************
///  ******************************************************************/
///
/// /*******************************************************************
///  *
///  *                   Core XDR Definitions
///  *
///  ******************************************************************/

/// /* 
///  * General XDR preliminaries for these features,
///  */
/// typedef uint32  xms_grpxn;
/// typedef uint32  xms_grpxc;
/// 
/// /* 
///  * Basic XDR typedefs for the new approach to DDP Specification.
///  */
/// typedef uint32  xmddp_itemlen;
/// typedef uint32  xmddp_pldisp;
/// typedef uint32  xmddp_vsdisp;
/// typedef uint32  xmddp_tbsn;
///  
/// /* 
///  * Define the possible types of DDP items.
///  */
/// enum xmddp_type {
///         XMDTYPE_EXRW = 1,
///         XMDTYPE_TBSN = 2,
///         XMDTYPE_CHOOSE = 3,
///         XMDTYPE_BYSIZE = 4,
///         XMDTYPE_TOOSHORT = 5,
///         XMDTYPE_NOITEM = 6
/// };
/// 
/// /*
///  * XDR defining the placemebnt of bulk items in the message being 
///  * sent.
///  */
/// union xmddp_loc switch(xmddp_type type)
/// 
///         case XMDTPE_EXRW:
///                 rpcrdma1_segment        xmdl_ex<>;
///         case XMDTYPE_TBSN:
///                 xmddp_itemlen           xmdl_offset;
///                 xmddp_tbsn              xmdl_bsnum<>;
///         case XMDTYPE_TOOSHORT:
///         case XMDTYPE_NOITEM:
///                 void;
/// };
/// 
/// 
/// 
/// struct xmddp_mitem {
///         xmddp_vsdisp    xmdmi_disp;
///         xmddp_itemlen   xmdmi_length;
///         xmddp_loc       xmdmi_where;
/// };
/// 
/// typedef xmddp_mitem     xmddp_grpinfo<>;
/// 
/// /*
///  * XDR defining the placement of bulk items in the response to the
///  * message being sent.
///  */
/// union xmddp_rsdloc switch(xmddp_type type)
/// 
///         case XMDTPE_EXRW:
///         case XMDTPE_CHOICE:
///                 rpcrdma1_segment        xmdrsdl_ex<>;
///         case XMDTPE_BYSIZE:
///                 xmddp_itemlen           xmdrsdl_dsdov;
///                 rpcrdma1_segment        xmdrsdl_bsex<>;
///         case XMDTYPE_TBSN:
///                 void;
/// };
/// 
/// struct xmddp_rsdrange {
///         xmddp_vsdisp    xmdrsdr_begin;
///         xmddp_vsdisp    xmdrsdr_end;
/// };
/// 
/// struct xmddp_rsditem {
///         xmddp_itemlen   xmdrsdi_minlen;
///         xmddp_rsdloc    xmdrsdi_loc;
/// };
/// 
/// struct xmddp_rsdset {
///         xmddp_rsdrange  xmdrsds_range;
///         xmddp_rsditem   xmdrsds_items<>;
/// };
/// 
/// typedef xmddp_rsdset    xmddp_rsdgroup<>;
///
/// /*******************************************************************
///  *
///  *                     New Transport Properties           
///  *
///  ******************************************************************/
///
/// /* 
///  * New Transport Property codes 
///  */ 
/// const uint32           XPROP_RTRSUPP = 3;
/// const uint32           XPROP_RBSTRUCT = 4;
/// const uint32           XPROP_REQRXLIM = 5;
/// const uint32           XPROP_RESPSXLIM = 6;
/// 
/// /*
///  * XDR relating to RTR Support Property
///  */
/// typedef uint32         xpr_rtrs;
/// 
/// const uint32           RTRS_XREQ = 1;
/// const uint32           RTRS_XRESP = 2;
/// const uint32           RTRS_XCONT = 4;
/// 
/// /* 
///  * Items related to Receive Buffer Structure Property
///  */
/// struct xmrbs_seg {
///         uint32          xmrseg_length;
///         uint32          xmrseg_align;
///         uint32          xmrseg_flags;
/// };
/// 
/// const uint32    XMRSFLAG_DDP = 0x01; 
/// 
/// struct xmrbs_group {
///         uint32          xmrgrp_count;
///         xmrbs_seg       xmrgrp_info;
/// };
/// 
/// struct xmrbs_buf {
///         uint32          xmrbuf_length;
///         xmrbs_group     xmrbuf_groups<>;
/// };
/// typedef xmrbs_buf      xpr_rbs;
/// 
/// /*
///  * XDR relating to tranismission limit properties
///  */
/// typedef uint32         xpr_rqrxl;
/// 
/// typedef uint32         xpr_rssxl;
///
/// /*******************************************************************
///  *
///  *                     New OPTIONAL Message Types           
///  *
///  ******************************************************************/
///
/// /* 
///  * New message type codes 
///  */ 
/// const uint32     ROPT_XMTREQ = 1;
/// const uint32     ROPT_XMTRESP = 2;
/// const uint32     ROPT_XMTCONT = 3;
/// const uint32     ROPT_REPTERR = 4;
/// 
/// 
/// /*
///  * New message type to do the initial transmission of a request.
///  */ 
/// struct optxmt_req {
///         xmddp_grpinfo   optxrq_ddp;
///         xmddp_rsdgroup  optxrq_rsd;
///         xms_grpxc       optxrq_count;
///         xms_grpxc       optxrq_rsbuf;
///         xmddp_pldisp    optxrq_pslen;
/// 
/// };        
///  
/// /*
///  * New message type to do the initial transmission of a response.
///  */ 
/// struct optxmt_resp {
///         xmddp_grpinfo   optxrs_ddp;
///         xms_grpxn       optxrs_count;
///         xmddp_pldisp    optxrs_pslen;
/// 
/// };        
/// 
/// /*
///  * New message type to transmit the continuation of a request or
///  * response.
///  */ 
/// struct optxmt_cont {
///         xms_grpxn       optxc_xnum;
///         uint32          optxc_itype;
///         xmddp_pldisp;   optxc_pslen;
/// };        
/// 
/// /*
///  * XDR definitions to support error reporting.
///  */ 
/// enum optr_err {
///         OPTRERR_BADHMT = 1,
///         OPTRERR_BADOMT = 2,
///         OPTRERR_BADCONT = 3,
///         OPTRERR_BADSEQ = 4,
///         OPTRERR_BADXID = 5,
///         OPTRERR_BADOFF = 6,
///         OPTRERR_BADTBSN = 7,
///         OPTRERR_BADPL = 8
/// }
/// 
/// union optr_info switch(optr_err optre_which) {
/// 
///   case OPTRERR_BADHMT:
///   case OPTRERR_BADOMT:
///   case OPTRERR_BADSEQ:
///   case OPTRERR_BADXID:
///         uint32          optri_expect;
///         uint32          optri_current;
/// 
///   case OPTRERR_BADCONT:
///         void;
/// 
/// 
///   case OPTRERR_BADTBSN:
///   case OPTRERR_BADOFF:
///   case OPTRERR_BADPL:
///         uint32          optri_value;
///         uint32          optri_min;
///         uint32          optri_max;
/// 
/// };
/// 
/// struct xms_id {
///         uint32         xmsi_xid;
///         msg_type       xmsi_dir;
///         xms_grpxn      xmsi_seq;
/// };
///
/// /*
///  * New message type for error reporting.
///  */ 
/// struct optrept_err {
///         xms_id          optre_bad;
///         xms_id          *optre_lead;
///         optr_info       optre_info;
/// };        
///  
/// 
<CODE ENDS>
          

11. Security Considerations

The extension described has the same security considerations described in [rfc5666bis] and [rpcrdmav2]. With regard to the transport properties introduced in this document, it is possible thar a man-in-the-middle could interfere with the communication of transport properties with possible negative effects. To present such interferece, the steps described in [rpcrdmav2] should be attended to.

The use of the techniques described in this document to reduce use of explicit RDMA operatios raise important issues which implementers should consider:

  • While the use of these techniques may be expedient in certai cases, their use is not likely to be universal, at least for a considerable time. As a result, implementers should remain aware of the issues discussed in Section 9.1 of [rfc5666bis], unless and until it is certain that none of a requesters memory can be registered for remote access.
  • Exra care needs to be taken in cases in which padding needs to be inserted in a transmission to ensure that DDP-targetable dsta item will be received in an apprpriately aligned buffer segment. In some implementtions, sensitive data could be inavertntly sent within the padding. To prevent this, the pading can be zeroed or it can be sent from a pre-zeroed area using a gather list.

12. IANA Considerations

This document does not require any actions by IANA.

13. References

13.1. Normative References

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997.
[RFC4506] Eisler, M., "XDR: External Data Representation Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 2006.
[rfc5666bis] Lever, C., Simpson, W. and T. Talpey, "Remote Direct Memory Access Transport for Remote Procedure Call", November 2016.

Work in progress.

13.2. Informative References

[RFC5662] Shepler, S., Eisler, M. and D. Noveck, "Network File System (NFS) Version 4 Minor Version 1 External Data Representation Standard (XDR) Description", RFC 5662, DOI 10.17487/RFC5662, January 2010.
[RFC5666] Talpey, T. and B. Callaghan, "Remote Direct Memory Access Transport for Remote Procedure Call", RFC 5666, DOI 10.17487/RFC5666, January 2010.
[RFC5667] Talpey, T. and B. Callaghan, "Network File System (NFS) Direct Data Placement", RFC 5667, DOI 10.17487/RFC5667, January 2010.
[rpcrdmav2] Lever, C. and D. Noveck, "RPC-over-RDMA Version Two", December 2016.

Work in progress.

Appendix A. Acknowledgements

The author gratefully acknowledges the work of Brent Callaghan and Tom Talpey producing the original RPC-over-RDMA Version One specification [RFC5666] and also Tom's work in helping to clarify that specification.

The author also wishes to thank Chuck Lever for his work resurrecting NFS support for RDMA in [rfc5666bis], for clarifying the relationshp between RDMA and direct data placement, and for beginning the work on RPC-over-RDMA Version Two.

The extract.sh shell script and formatting conventions were first described by the authors of the NFSv4.1 XDR specification [RFC5662].

Author's Address

David Noveck Hewlett Packard Enterprise 165 Dascomb Road Andover, MA 01810 USA Phone: +1 781-572-8038 EMail: davenoveck@gmail.com