Internet DRAFT - draft-callaghan-rpc-rdma
draft-callaghan-rpc-rdma
Internet-Draft Brent Callaghan
Expires: November 2003 Sun Microsystems, Inc.
Tom Talpey
Network Appliance, Inc.
Document: draft-callaghan-rpcrdma-00.txt May, 2003
RDMA Transport for ONC RPC
Status of this Memo
This document is an Internet-Draft and is subject to all provisions
of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet- Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
This memo provides information for the Internet community. This memo
does not specify an Internet standard of any kind. Distribution of
this memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (2003). All Rights Reserved.
Expires: November 2003 Callaghan and Talpey [Page 1]
Internet-Draft RDMA Transport for ONC RPC May 2003
Abstract
A protocol is described providing RDMA as a new transport for ONC
RPC. The RDMA transport binding conveys the benefits of efficient,
bulk data transport over high speed networks, while providing for
minimal change to RPC applications and with no required revision of
the application RPC protocol, or the RPC protocol itself.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Abstract RDMA Model . . . . . . . . . . . . . . . . . . . . 3
3. Protocol Outline . . . . . . . . . . . . . . . . . . . . . . 5
3.1. Short Messages . . . . . . . . . . . . . . . . . . . . . . 5
3.2. Data Chunks . . . . . . . . . . . . . . . . . . . . . . . 6
3.3. Flow Control . . . . . . . . . . . . . . . . . . . . . . . 6
3.4. XDR Encoding with Chunks . . . . . . . . . . . . . . . . . 7
3.5. Padding . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6. XDR Decoding with Read Chunks . . . . . . . . . . . . . 10
3.7. XDR Decoding with Write Chunks . . . . . . . . . . . . . 10
3.8. RPC Call and Reply . . . . . . . . . . . . . . . . . . . 11
4. RPC RDMA Message Layout . . . . . . . . . . . . . . . . . 13
4.1. RPC RDMA Transport Header . . . . . . . . . . . . . . . 13
4.2. XDR Language Description . . . . . . . . . . . . . . . . 15
5. Large Chunkless Messages . . . . . . . . . . . . . . . . . 16
5.1. Message as an RDMA Read Chunk . . . . . . . . . . . . . 17
5.2. RDMA Write of Long Replies . . . . . . . . . . . . . . . 18
6. Connection Configuration Protocol . . . . . . . . . . . . 19
6.1. Initial Connection State . . . . . . . . . . . . . . . . 20
6.2. Protocol Description . . . . . . . . . . . . . . . . . . 20
7. Memory Registration Overhead . . . . . . . . . . . . . . . 21
8. Errors and Error Recovery . . . . . . . . . . . . . . . . 21
9. Node Addressing . . . . . . . . . . . . . . . . . . . . . 21
10. RPC Binding . . . . . . . . . . . . . . . . . . . . . . . 22
11. Security . . . . . . . . . . . . . . . . . . . . . . . . 22
12. IANA Considerations . . . . . . . . . . . . . . . . . . . 23
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . 23
14. References . . . . . . . . . . . . . . . . . . . . . . . 23
15. Authors' Addresses . . . . . . . . . . . . . . . . . . . 24
16. Full Copyright Statement . . . . . . . . . . . . . . . . 25
1. Introduction
RDMA is a technique for efficient movement of data over high speed
transports. It facilitates data movement via direct memory access by
hardware, yielding faster transfers of data over a network while
Expires: November 2003 Callaghan and Talpey [Page 2]
Internet-Draft RDMA Transport for ONC RPC May 2003
reducing host CPU overhead.
ONC RPC [RFC1831] is a remote procedure call protocol that has been
run over a variety of transports. Most implementations today use UDP
or TCP. RPC messages are defined in terms of an eXternal Data
Representation (XDR) [RFC1832] which provides a canonical data
representation across a variety of host architectures. An XDR data
stream is conveyed differently on each type of transport. On UDP,
RPC messages are encapsulated inside datagrams, while on a TCP byte
stream, RPC messages are delineated by a record marking protocol. An
RDMA transport also conveys RPC messages in a unique fashion that
must be fully described if client and server implementations are to
interoperate.
RDMA transports present new semantics unlike the behaviors of either
UDP and TCP. They retain message delineations like UDP while also
providing a reliable, sequenced data transfer like TCP. All provide
the new efficient, bulk transfer service of RDMA. RDMA transports
are therefore naturally viewed as a new transport type by ONC RPC.
RDMA as a transport will benefit the performance of RPC protocols
that move large "chunks" of data, since RDMA hardware excels at
moving data efficiently between host memory and a high speed network
with little or no host CPU involvement. In this context, the NFS
protocol, in all its versions, is an obvious beneficiary of RDMA.
Many other RPC-based protocols will also benefit.
Although the RDMA transport described here provides relatively
transparent support for any RPC application, the proposal goes
further in describing mechanisms that can optimize the use of RDMA
with more active participation by the RPC application.
2. Abstract RDMA Model
An RPC transport is responsible for conveying an RPC message from a
sender to a receiver. An RPC message is either an RPC call from a
client to a server, or an RPC reply from the server back to the
client. An RPC message contains an RPC call header followed by
arguments if the message is an RPC call, or an RPC reply header
followed by results if the message is an RPC reply. The call header
contains a transaction ID (XID) followed by the program and procedure
number as well as a security credential. An RPC reply header begins
with an XID that matches that of the RPC call message, followed by a
security verifier and results. All data in an RPC message is XDR
encoded. For a complete description of the RPC protocol and XDR
encoding, see [RFC1831] and [RFC1832].
Expires: November 2003 Callaghan and Talpey [Page 3]
Internet-Draft RDMA Transport for ONC RPC May 2003
This protocol assumes an abstract model for RDMA transports. The
following terms, common in the RDMA lexicon, are used in this
document. A more complete glossary of RDMA terms can be found in
[RDMA].
o Registered Memory
All data moved via RDMA must be resident in registered
memory at its source and destination. Each segment of
registered memory must be identified with a Steering Tag
(STag) of no more than 32 bits and memory addresses of up
to 64 bits in length.
o RDMA Send
The RDMA provider supports an RDMA Send operation with
completion signalled at the receiver when data is placed
in a pre-posted buffer. The amount of transferred data
is limited only by the size of the receiver's buffer.
Sends complete at the receiver in the order they were
issued at the sender.
o RDMA Write
The RDMA provider supports an RDMA Write operation to
directly place data in the receiver's buffer. An RDMA
Write is initiated by the sender and completion is
signalled at the sender. No completion is signalled at
the receiver. The sender uses a Steering Tag (STag),
memory address and length of the remote destination
buffer. A subsequent completion, provided by RDMA Send,
must be obtained at the receiver to guarantee that RDMA
Write data has been successfully placed in the receiver's
memory.
Expires: November 2003 Callaghan and Talpey [Page 4]
Internet-Draft RDMA Transport for ONC RPC May 2003
o RDMA Read
The RDMA provider supports an RDMA Read operation to
directly place peer source data in the requester's buffer.
An RDMA Read is initiated by the receiver and completion is
signalled at the receiver. The receiver provides
Steering Tags, memory addresses and a length for the
remote source and local destination buffers.
Since the peer at the data source receives no notification
of RDMA Read completion, there is an assumption that on
receiving the data the receiver will signal completion
with an RDMA Send message, so that the peer can free the
source buffers.
In its abstract form, this protocol is not an interoperable standard.
It becomes a useful, implementable standard only when mapped onto a
specific RDMA transport, like iWARP [RDDP] or Infiniband [IB].
3. Protocol Outline
An RPC message can be conveyed in identical fashion, whether it is a
CALL or REPLY message. In each case, the transmission of the message
proper is preceded by transmission of a transport header for use by
RPC over RDMA transports. This header is analogous to the record
marking used for RPC over TCP, but is more extensive, since RDMA
transports support several modes of data transfer and it is important
to allow the client and server to use the most efficient mode for any
given transfer and also because multiple pieces of a message may be
transferred in different ways to different destinations.
All transfers of a CALL or REPLY begin with an RDMA send which
transfers at least the transport header, usually with the CALL or
REPLY message appended, or at least some part thereof. Because the
size of what may be transmitted via RDMA send is limited by the size
of the receiver's pre-posted buffer, the RPC over RDMA transport
provides a number of methods to reduce the amount transferred by
means of the RDMA send, when necessary, by transferring various parts
of the message using RDMA read and RDMA write.
3.1. Short Messages
Many RPC messages are quite short. For example, the NFS version 3
GETATTR request, is only 56 bytes: 20 bytes of RPC header plus a 32
byte filehandle argument and 4 bytes of length. The reply to this
common request is about 100 bytes.
Expires: November 2003 Callaghan and Talpey [Page 5]
Internet-Draft RDMA Transport for ONC RPC May 2003
There is no benefit in transferring such small messages with an RDMA
Read or Write operation. The overhead in transferring STags and
memory addresses is justified only by large transfers. The critical
message size that justifies RDMA transfer will vary depending on the
RDMA implementation and network, but is typically of the order of a
few kilobytes. It is appropriate to transfer a short message with an
RDMA Send to a pre-posted buffer. The transport header with the
short message (CALL or REPLY) immediately following is transferred
using a single RDMA send operation.
Short RPC messages over an RDMA transport will look like this:
Client Server
| RPC Call |
Send | ------------------------------> |
| |
| RPC Reply |
| <------------------------------ | Send
3.2. Data Chunks
Some protocols, like NFS, have RPC procedures that can transfer very
large "chunks" of data in the RPC call or reply and would cause the
maximum send size to be exceeded if one tried to transfer them as
part of the RDMA send. These large chunks typically range from a
kilobyte to a megabyte or more. An RDMA transport can transfer large
chunks of data more efficiently via the direct placement of an RDMA
Read or RDMA Write operation. Using direct placement instead of in-
line transfer not only avoids expensive data copies, but provides
correct data alignment at the destination.
3.3. Flow Control
It is critical to provide flow control for an RDMA connection. RDMA
receive operations will fail if a pre-posted receive buffer is not
available to accept an incoming RDMA Send. Such errors are fatal to
the connection. This is a departure from conventional TCP/IP
networking where buffers are allocated dynamically on an as-needed
basis, and pre-posting is not required.
It is not practical to provide for fixed credit limits at the RPC
server. Fixed limits scale poorly, since posted buffers are
dedicated to the associated connection until consumed by receive
operations. Additionally for protocol correctness, the server must
be able to reply whether or not a new buffer can be posted to accept
Expires: November 2003 Callaghan and Talpey [Page 6]
Internet-Draft RDMA Transport for ONC RPC May 2003
future receives.
Flow control is implemented as a simple request/grant protocol in the
transport header associated with each RPC message. The transport
header for RPC CALL messages contains a requested credit value for
the server, which may be dynamically adjusted by the caller to match
its expected needs. The transport header for the RPC REPLY messages
provide the granted result, which may have any value except it may
not be zero when no in-progress operations are present at the server,
since such a value would result in deadlock. The value may be
adjusted up or down at each opportunity to match the server's needs
or policies.
While RPC CALLs may complete in any order, the current flow control
limit at the RPC server is known to the RPC client from the Send
ordering properties. It is always the most recent server granted
credits minus the number of requests in flight.
3.4. XDR Encoding with Chunks
The data comprising an RPC call or reply message is marshaled or
serialized into a contiguous stream by an XDR routine. XDR data
types such as integers, strings, arrays and linked lists are commonly
implemented over two very simple functions that encode either an XDR
data unit (32 bits) or an array of bytes.
Normally, the separate data items in an XDR call or reply are encoded
as a contiguous sequence of bytes for network transmission over UDP
or TCP. However, in the case of an RDMA transport, local routines
such as XDR encode can determine that an opaque byte array is large
enough to be more efficiently moved via an RDMA data transfer
operation like RDMA Read or RDMA Write.
When sending any message (request or reply) that contains a candidate
large data chunk, the XDR encoding routine avoids moving the data
into the XDR stream. Instead, it does not encode at all but records
the address and size of each chunk in a separate "read chunk list"
encoded within RPC RDMA transport-specific headers. Such chunks will
be transferred via RDMA Read operations initiated by the receiver.
Since the chunks are to be moved via RDMA, the memory for each chunk
must be registered. This registration may take place within XDR
itself, providing for full transparency to upper layers, or it may be
performed by any other specific local implementation.
Additionally, when making an RPC call that can result in bulk data
transferred in the reply, it is desirable to provide chunks to accept
Expires: November 2003 Callaghan and Talpey [Page 7]
Internet-Draft RDMA Transport for ONC RPC May 2003
the data directly via RDMA Write. These chunks will therefore be
pre-filled by the server prior to responding, and XDR decode at the
client will not be required. These "write chunk lists" undergo a
similar registration and advertisement to chunks built as a part of
XDR encoding. Just as with an encoded read chunk list, the memory
referenced in an encoded write chunk list must be pre-registered. If
the client chooses not to make a write chunk list available, then the
server must return chunks in the reply via a read chunk list.
The following items are contained in a chunk list entry.
STag
Steering tag or handle obtained when the chunk
memory is registered for RDMA.
Length
The length of the chunk in bytes.
Offset
The offset or memory address of the chunk.
Position
For data which is to be encoded, the position in
the XDR stream where the chunk would normally
reside. It is possible that a contiguous sequence
of chunks might all have the same position. For
data which is to be decoded, no "position" is
used.
When XDR marshaling is complete, the chunk list is XDR encoded, then
sent to the receiver prepended to the RPC message. Any source data
for a read chunk, or the destination of a write chunk, remain behind
in the sender's registered memory.
+----------------+----------------+-------------
| | |
| RDMA header w/ | RPC Header | Non-chunk args/results
| chunks | |
+----------------+----------------+-------------
Read chunk lists are structured differently from write chunk lists.
This is due to the different usage - read chunks are decoded and
indexed by their position in the XDR data stream, and may be used for
both arguments and results. Write chunks on the other hand are used
only for results, and have no preassigned offset in the XDR stream
until the results are produced. The mapping of Write chunks onto
designated NFS procedures and results is described in [NFSDDP].
Therefore, read chunks are encoded as a single array, with each entry
tagged by its position in the XDR stream. Write chunks are encoded
Expires: November 2003 Callaghan and Talpey [Page 8]
Internet-Draft RDMA Transport for ONC RPC May 2003
as a list of arrays of RDMA buffers, with each list element providing
buffers for a separate result.
3.5. Padding
Alignment of specific opaque data enables certain scatter/gather
optimizations. Padding leverages the useful property that RDMA
transfers preserve alignment of data, even when they are placed into
pre-posted receive buffers by Sends.
Many servers can make good use of such padding. Padding allows the
chaining of RDMA receive buffers such that any data transferred by
RDMA on behalf of RPC requests will be placed into appropriately
aligned buffers on the system that receives the transfer. In this
way, the need for servers to perform RDMA Read to satisfy all but the
largest client writes is obviated.
The effect of padding is demonstrated below showing prior bytes on an
XDR stream (XXX) followed by an opaque field consisting of four
length bytes (LLLL) followed by data bytes (DDDD). The receiver of
the RDMA Send has posted two chained receive buffers. Without
padding, the opaque data is split across the two buffers. With the
addition of padding bytes (ppp) prior to the first data byte, the
data can be forced to align correctly in the second buffer.
Buffer 1 Buffer 2
Unpadded -------------- --------------
XXXXXXXLLLLDDDDDDDDDDDDDD ---> XXXXXXXLLLLDDD DDDDDDDDDDD
Padded
XXXXXXXLLLLpppDDDDDDDDDDDDDD ---> XXXXXXXLLLLppp DDDDDDDDDDDDDD
Padding is implemented completely within the RDMA transport encoding,
flagged with a specific message type. Where padding is applied, two
values are passed to the peer: an "rdma_align" which is the padding
value used, and "rdma_thresh", which is the opaque data size at or
above which padding is applied. For instance, if the server is using
chained 4 KB receive buffers, then up to (4 KB - 1) padding bytes
could be used to achieve alignment of the data. If padding is to
apply only to chunks at least 1 KB in size, then the threshold should
be set to 1 KB. The XDR routine at the peer will consult these
values when decoding opaque values. Where the decoded length exceeds
the rdma_thresh, the XDR decode will skip over the appropriate
padding as indicated by rdma_align and the current XDR stream
Expires: November 2003 Callaghan and Talpey [Page 9]
Internet-Draft RDMA Transport for ONC RPC May 2003
position.
3.6. XDR Decoding with Read Chunks
The XDR decode process moves data from an XDR stream into a data
structure provided by the client or server application. Where
elements of the destination data structure are buffers or strings,
the RPC application can either pre-allocate storage to receive the
data, or leave the string or buffer fields null and allow the XDR
decode to automatically allocate storage of sufficient size.
When decoding a message from an RDMA transport, the receiver first
XDR decodes the chunk lists from the RDMA transport header, then
proceeds to decode the body of the RPC message (arguments or
results). Whenever the XDR offset in the decode stream matches that
of a chunk in the read chunk list, the XDR routine registers the
memory for the destination buffer, then initiates an RDMA Read to
bring over the chunk data. If an RPC client uses RDMA Read to fetch
chunks in the reply then it must issue an RDMA_DONE message
(described in Section 3.8) to notify the server that the source
buffers can be freed.
The read chunk list is constructed and used entirely within the
RPC/XDR layer. Other than specifying the minimum chunk size, the
management of the read chunk list is automatic and transparent to an
RPC application.
3.7. XDR Decoding with Write Chunks
When a "write chunk list" is provided in the RPC CALL, the server
must provide any corresponding data via RDMA Write to the memory
referenced in the chunk list entries. The RPC REPLY conveys this by
returning the write chunk list to the client with the lengths
rewritten to match the actual transfer. The XDR "decode" of the
reply therefore performs no local data transfer but merely returns
the length obtained from the reply.
Each decoded result consumes one entry in the write chunk list, which
in turn consists of an array of RDMA segments. The length is
therefore the sum of all returned lengths in all segments comprising
the corresponding list entry. As each list entry is "decoded", the
entire entry is consumed.
The write chunk list is constructed and used by the RPC application.
The RPC/XDR layer simply conveys the list between client and server
and initiates the RDMA Writes back to the client. The mapping of
Expires: November 2003 Callaghan and Talpey [Page 10]
Internet-Draft RDMA Transport for ONC RPC May 2003
write chunk list entries to procedure arguments must be determined
for each protocol. An example of a mapping is described in [NFSDDP].
3.8. RPC Call and Reply
The RDMA transport for RPC provides three methods of moving data
between client and server:
In-line
Data are moved between client and server
within an RDMA Send.
RDMA Read
Data are moved between client and server
via an RDMA Read operation via STag, address
and offset obtained from a read chunk list.
RDMA Write
Result data is moved from server to client
via an RDMA Write operation via STag, address
and offset obtained from a write chunk list
or reply chunk in the client's RPC call message.
These methods of data movement may occur in combinations within a
single RPC. For instance, an RPC call may contain some in-line data
along with some large chunks transferred via RDMA Read by the server.
The reply to that call may have some result chunks that the server
RDMA Writes back to the client. The following protocol interactions
illustrate RPC calls that use these methods to move RPC message data:
An RPC with write chunks in the call message looks like this:
Client Server
| RPC Call + Write Chunk list |
Send | ------------------------------> |
| |
| Chunk 1 |
| <------------------------------ | Write
| : |
| Chunk n |
| <------------------------------ | Write
| |
| RPC Reply |
| <------------------------------ | Send
Expires: November 2003 Callaghan and Talpey [Page 11]
Internet-Draft RDMA Transport for ONC RPC May 2003
An RPC with read chunks in the call message looks like this:
Client Server
| RPC Call + Read Chunk list |
Send | ------------------------------> |
| |
| Chunk 1 |
| +------------------------------ | Read
| v-----------------------------> |
| : |
| Chunk n |
| +------------------------------ | Read
| v-----------------------------> |
| |
| RPC Reply |
| <------------------------------ | Send
And an RPC with read chunks in the reply message looks like this:
Client Server
| RPC Call |
Send | ------------------------------> |
| |
| RPC Reply + Read Chunk list |
| <------------------------------ | Send
| |
| Chunk 1 |
Read | ------------------------------+ |
| <-----------------------------v |
| : |
| Chunk n |
Read | ------------------------------+ |
| <-----------------------------v |
| |
| RPC Done |
Send | ------------------------------> |
The final RPC Done message allows the client to signal the server
that it has received the chunks, so the server can de-register and
free the memory holding the chunks. An RPC Done completion is not
necessary for an RPC call, since the RPC reply Send is itself a
receive completion notification.
The RPC Done message has no effect on protocol latency since the
client has no expectation of a reply from the server. Nor does it
adversely affect bandwidth since it is only 16 bytes in length. In
Expires: November 2003 Callaghan and Talpey [Page 12]
Internet-Draft RDMA Transport for ONC RPC May 2003
the event that the client fails to return the Done message, the
server can proceed with a de-register and free chunk buffers after a
time-out.
Finally, it is possible to conceive of RPC exchanges that involve any
or all combinations of write chunks in the RPC CALL, read chunks in
the RPC CALL, and read chunks in the RPC REPLY. Support for such
exchanges is straightforward from a protocol perspective, but in
practice such exchanges would be quite rare, limited to upper layer
protocol exchanges which transferred bulk data in both the call and
corresponding reply.
4. RPC RDMA Message Layout
RPC call and reply messages are conveyed across an RDMA transport
with a prepended RDMA transport header. The transport header
includes data for RDMA flow control credits, padding parameters and
lists of addresses that provide direct data placement via RDMA Read
and Write operations. The layout of the RPC message itself is
unchanged from that described in [RFC1831] except for the possible
exclusion of large data chunks that will be moved by RDMA Read or
Write operations. If the RPC message (along with the transport
header) is too long for the posted receive buffer (even after any
large chunks are removed), then the entire RPC message can be moved
separately as a chunk, leaving just the transport header in the RDMA
Send.
4.1. RPC RDMA Transport Header
The RPC RDMA transport header begins with four 32-bit fields that are
always present and which control the RDMA interaction including
RDMA-specific flow control. These are then followed by a number of
items such as chunk lists and padding which may or may not be present
depending on the type of transmission. The four fields which are
always present are:
Expires: November 2003 Callaghan and Talpey [Page 13]
Internet-Draft RDMA Transport for ONC RPC May 2003
1. Transaction ID (XID).
The XID generated for the RPC call and reply. Having
the XID at the beginning of the message makes it easy to
establish the message context. This XID mirrors the XID
in the RPC call header, and takes precedence.
2. Version number.
This version of the RPC RDMA message protocol is 1.
The version number must be increased by one whenever the
format of the RPC RDMA messages is changed.
3. Flow control credit value.
When sent in an RPC CALL message, the requested value is
provided. When sent in an RPC REPLY message, the
granted value is returned. RPC CALLs must not be sent
in excess of the currently granted limit.
4. Message type.
RDMA_MSG = 0 indicates that chunk lists and RPC message
follow. RDMA_NOMSG = 1 indicates that after the chunk
lists there is no RPC message. In this case, the chunk
lists provide information to allow the message proper to
be transferred using RDMA read or write and thus is not
appended to the RPC RDMA transport header. RDMA_MSGP =
2 indicates that a chunk list and RPC message with some
padding follow. RDMA_DONE = 3 indicates that the
message signals the completion of a chunk transfer via
RDMA Read.
For a message of type RDMA_MSG or RDMA_NOMSG, the Read and Write
chunk lists follow. If the Read chunk list is null (a 32 bit word of
zeros), then there are no chunks to be transferred separately and the
RPC message follows in its entirety. If non-null, then it's the
beginning of an XDR encoded sequence of Read chunk list entries. If
the Write chunk list is non-null, then an XDR encoded sequence of
Write chunk entries follows.
If the message type is RDMA_MSGP, then two additional fields that
specify the padding alignment and threshold are inserted prior to the
Read and Write chunk lists.
A transport header of message type RDMA_MSG or RDMA_MSGP will be
followed by the RPC call or reply message, beginning with the XID.
This XID should match the one at the beginning of the RPC message
header.
Expires: November 2003 Callaghan and Talpey [Page 14]
Internet-Draft RDMA Transport for ONC RPC May 2003
+--------+---------+---------+-----------+-------------+----------
| | | | Message | NULLs | RPC Call
| XID | Version | Credits | Type | or | or
| | | | | Chunk Lists | Reply Msg
+--------+---------+---------+-----------+-------------+----------
Note that in the case of RDMA_DONE, no chunk list or RPC message
follows. As an implementation hint: a gather operation on the Send
of the RDMA RPC message can be used to marshal the initial header,
the chunk list, and the RPC message itself.
4.2. XDR Language Description
Here is the message layout in XDR language.
struct xdr_rdma_segment {
uint32 handle; /* Registered memory handle */
uint32 length; /* Length of the chunk in bytes */
uint64 offset; /* Chunk virtual address or offset */
};
struct xdr_read_chunk {
uint32 position; /* Position in XDR stream */
struct xdr_rdma_segment target;
};
struct xdr_read_list {
struct xdr_read_chunk entry;
struct xdr_read_list *next;
};
struct xdr_write_chunk {
struct xdr_rdma_segment target<>;
};
struct xdr_write_list {
struct xdr_write_chunk entry;
struct xdr_write_list *next;
};
struct rdma_msg {
uint32 rdma_xid; /* Mirrors the RPC header xid */
uint32 rdma_vers; /* Version of this protocol */
uint32 rdma_credit; /* Buffers requested/granted */
rdma_body rdma_body;
};
Expires: November 2003 Callaghan and Talpey [Page 15]
Internet-Draft RDMA Transport for ONC RPC May 2003
enum rdma_proc {
RDMA_MSG=0, /* An RPC call or reply msg */
RDMA_NOMSG=1, /* An RPC call or reply msg - separate body */
RDMA_MSGP=2, /* An RPC call or reply msg with padding */
RDMA_DONE=3 /* Client signals reply completion */
};
union rdma_body switch (rdma_proc proc) {
case RDMA_MSG:
rpc_rdma_header rdma_msg;
case RDMA_NOMSG:
rpc_rdma_header_nomsg rdma_nomsg;
case RDMA_MSGP:
rpc_rdma_header_padded rdma_msgp;
case RDMA_DONE:
void;
};
struct rpc_rdma_header {
struct xdr_read_list *rdma_reads;
struct xdr_write_list *rdma_writes;
struct xdr_write_chunk *rdma_reply;
/* rpc body follows */
};
struct rpc_rdma_header_nomsg {
struct xdr_read_list *rdma_reads;
struct xdr_write_list *rdma_writes;
struct xdr_write_chunk *rdma_reply;
};
struct rpc_rdma_header_padded {
uint32 rdma_align; /* Padding alignment */
uint32 rdma_thresh; /* Padding threshold */
struct xdr_read_list *rdma_reads;
struct xdr_write_list *rdma_writes;
struct xdr_write_chunk *rdma_reply;
/* rpc body follows */
};
5. Large Chunkless Messages
The receiver of RDMA Send messages is required to have previously
posted one or more correctly sized buffers. The client can inform
the server of the maximum size of its RDMA Send messages via the
Connection Configuration Protocol described later in this document.
Expires: November 2003 Callaghan and Talpey [Page 16]
Internet-Draft RDMA Transport for ONC RPC May 2003
Since RPC messages are frequently small, memory savings can be
achieved by posting small buffers. Even large messages like NFS READ
or WRITE will be quite small once the chunks are removed from the
message. However, there may be large, chunkless messages that would
demand a very large buffer be posted. A good example is an NFS
READDIR reply which may contain a large number of small filename
strings. Also, the NFS version 4 protocol [RFC3530] features
COMPOUND request and reply messages of unbounded length.
Ideally, each upper layer will negotiate these limits. However, it
is frequently necessary to provide a transparent solution.
5.1. Message as an RDMA Read Chunk
One relatively simple method is to have the client identify any RPC
message that exceeds the server's posted buffer size and move it
separately as a chunk, i.e. reference it as the first entry in the
read chunk list with an XDR position of zero.
Normal Message
+--------+---------+---------+------------+-------------+----------
| | | | | | RPC Call
| XID | Version | Credits | RDMA_MSG | Chunk Lists | or
| | | | | | Reply Msg
+--------+---------+---------+------------+-------------+----------
Long Message
+--------+---------+---------+------------+-------------+
| | | | | |
| XID | Version | Credits | RDMA_NOMSG | Chunk Lists |
| | | | | |
+--------+---------+---------+------------+-------------+
|
| +----------
| | Long RPC Call
+->| or
| Reply Message
+----------
If the receiver gets a transport header with a message type of
RDMA_NOMSG and finds an initial read chunk list entry with a zero XDR
position, it allocates a registered buffer and issues an RDMA Read of
the long RPC message into it. The receiver then proceeds to XDR
decode the RPC message as if it had received it in-line with the Send
data. Further decoding may issue additional RDMA Reads to bring over
additional chunks.
Expires: November 2003 Callaghan and Talpey [Page 17]
Internet-Draft RDMA Transport for ONC RPC May 2003
Although the handling of long messages requires one extra network
turnaround, in practice these messages should be rare if the posted
receive buffers are correctly sized, and of course they will be non-
existent for RDMA-aware upper layers.
An RPC with long reply returned via RDMA Read looks
like this:
Client Server
| RPC Call |
Send | ------------------------------> |
| |
| RPC Transport Header |
| <------------------------------ | Send
| |
| Long RPC Reply Msg |
Read | ------------------------------+ |
| <-----------------------------v |
| |
| RPC Done |
Send | ------------------------------> |
5.2. RDMA Write of Long Replies
An alternative method of handling long, chunkless RPC replies is to
have the client post a large buffer into which the server can write a
large RPC reply. This has the advantage that an RDMA Write may be
slightly faster in network latency than an RDMA Read. Additionally,
for a reply it removes the need for an RDMA_DONE message if the large
reply is returned as a Read chunk.
This protocol supports direct return of a large reply via the
inclusion of an optional rdma_reply write chunk after the read chunk
list and the write chunk list. The client allocates a buffer sized
to receive a large reply and enters its STag, address and length in
the rdma_reply write chunk. If the reply message is too long to
return in-line with an RDMA Send (exceeds the size of the client's
posted receive buffer), even with read chunks removed, then the
server RDMA writes the RPC reply message into the buffer indicated by
the rdma_reply chunk. If the client doesn't provide an rdma_reply
chunk, or if it's too small, then the message must be returned as a
Read chunk.
Expires: November 2003 Callaghan and Talpey [Page 18]
Internet-Draft RDMA Transport for ONC RPC May 2003
An RPC with long reply returned via RDMA Write looks
like this:
Client Server
| RPC Call with rdma_reply |
Send | ------------------------------> |
| |
| Long RPC Reply Msg |
| <------------------------------ | Write
| |
| RPC Transport Header |
| <------------------------------ | Send
The use of RDMA Write to return long replies requires that the client
application anticipate a long reply and have some knowledge of its
size so that a correctly sized buffer can be allocated. This is
certainly true of NFS READDIR replies; where the client already
provides an upper bound on the size of the encoded directory fragment
to be returned by the server.
6. Connection Configuration Protocol
RDMA Send operations require the receiver to post one or more buffers
at the RDMA connection endpoint, each large enough to receive the
largest Send message. Buffers are consumed as Send messages are
received. If a buffer is too small, or if there are no buffers
posted, the RDMA transport will return an error and break the RDMA
connection. The receiver must post sufficient, correctly sized
buffers to avoid buffer overrun or capacity errors.
The protocol described above includes only a mechanism for managing
the number of such receive buffers, and no explicit features to allow
the client and server to provision or control buffer sizing, nor any
other session parameters.
In the past, this type of connection management has not been
necessary for RPC. RPC over UDP or TCP does not have a protocol to
negotiate the link. The server can get a rough idea of the maximum
size of messages from the server protocol code. However, a protocol
to negotiate transport features on a more dynamic basis is desirable.
The Connection Configuration Protocol allows the client to pass its
connection requirements to the server, and allows the server to
inform the client of its connection limits.
Expires: November 2003 Callaghan and Talpey [Page 19]
Internet-Draft RDMA Transport for ONC RPC May 2003
6.1. Initial Connection State
This protocol will be used for connection setup prior to the use of
another RPC protocol that uses the RDMA transport. It operates in-
band, i.e. it uses the connection itself to negotiate the connection
parameters. To provide a basis for connection negotiation, the
connection is assumed to provide a basic level of interoperability:
the ability to exchange at least one RPC message at a time that is at
least 1 KB in size. The server may exceed this basic level of
configuration, but the client must not assume it.
6.2. Protocol Description
Version 1 of the protocol consists of a single procedure that allows
the client to inform the server of its connection requirements and
the server to return connection information to the client.
The maxcallsize argument is the maximum size of an RPC call message
that the client will send in-line in an RDMA Send message to the
server. The server may return a maxcallsize value that is smaller or
larger than the client's request. The client must not send an in-
line call message larger than what the server will accept. The
maxcallsize limits only the size of in-line RPC calls. It does not
limit the size of long RPC messages transferred as an initial chunk
in the Read chunk list.
The maxreplysize is the maximum size of an in-line RPC message that
the client will accept from the server.
The align value is the value recommended by the server for opaque
data values such as strings and counted byte arrays. The client can
use this value to compute the number of prepended pad bytes when XDR
encoding opaque values in the RPC call message.
typedef unsigned int uint32;
struct config_rdma_req {
uint32 maxcallsize; /* max size of in-line RPC call */
uint32 maxreplysize; /* max size of in-line RPC reply */
};
struct config_rdma_reply {
uint32 maxcallsize; /* max call size accepted by server */
uint32 align; /* server's receive buffer alignment */
};
Expires: November 2003 Callaghan and Talpey [Page 20]
Internet-Draft RDMA Transport for ONC RPC May 2003
program CONFIG_RDMA_PROG {
version VERS1 {
/*
* Config call/reply
*/
config_rdma_reply CONF_RDMA(config_rdma_req) = 1;
} = 1;
} = nnnnnn; <-- Need program number assigned
7. Memory Registration Overhead
RDMA requires that all data be transferred between registered memory
regions at the source and destination. All protocol headers as well
as separately transferred data chunks must use registered memory.
Since the cost of registering and de-registering memory can be a
large proportion of the RDMA transaction cost, it is important to
minimize registration activity. This is easily achieved within RPC
controlled memory by allocating chunk list data and RPC headers in a
reusable way from pre-registered pools.
The data chunks transferred via RDMA may occupy memory that persists
outside the bounds of the RPC transaction. Hence, the default
behavior of an RDMA transport is to register and de-register these
chunks on every transaction. However, this is not a limitation of
the protocol - only of the existing local RPC API. The API is easily
extended through such functions as rpc_control(3) to change the
default behavior so that the application can assume responsibility
for controlling memory registration through an RPC-provided
registered memory allocator.
8. Errors and Error Recovery
Error reporting and recovery is outside the scope of this protocol.
It is assumed that the link itself will provide some degree of error
detection and retransmission. Additionally, the RPC layer itself can
accept errors from the link level and recover via retransmission.
RPC recovery can handle complete loss and re-establishment of the
link.
9. Node Addressing
In setting up a new RDMA connection, the first action by an RPC
client will be to obtain a transport address for the server. The
mechanism used to obtain this address, and to open an RDMA connection
Expires: November 2003 Callaghan and Talpey [Page 21]
Internet-Draft RDMA Transport for ONC RPC May 2003
is dependent on the type of RDMA transport, and outside the scope of
this protocol.
10. RPC Binding
RPC services normally register with a portmap or rpcbind service,
which associates an RPC program number with a service address. In
the case of UDP or TCP, the service address for NFS is normally port
2049. This policy should be no different with RDMA interconnects.
One possibility is to have the server's portmapper register itself on
the RDMA interconnect at a "well known" service address. On UDP or
TCP, this corresponds to port 111. A client could connect to this
service address and use the portmap protocol to obtain a service
address in response to a program number, e.g. a VI discriminator or
an Infiniband GID.
11. Security
ONC RPC provides its own security via the RPCSEC_GSS framework [RFC
2203]. RPCSEC_GSS can provide message authentication, integrity
checking, and privacy. This security mechanism will be unaffected by
the RDMA transport. The data integrity and privacy features alter
the body of the message, presenting it as a single chunk. For large
messages the chunk may be large enough to qualify for RDMA Read
transfer. However, there is much data movement associated with
computation and verification of integrity, or encryption/decryption,
so any performance advantage will be lost.
There should be no new issues here with exposed addresses. The only
exposed addresses here are in the chunk list and in the transport
packets generated by an RDMA. The data contained in these addresses
is adequately protected by RPCSEC_GSS integrity and privacy.
RPCSEC_GSS security mechanisms are typically implemented by the host
CPU. This additional data movement and CPU use may cancel out much
of the RDMA direct placement and offload benefit.
A more appropriate security mechanism for RDMA links may be link-
level protection, like IPSec, which may be co-located in the RDMA
link hardware. The use of link-level protection may be negotiated
through the use of a new RPCSEC_GSS mechanism like the Credential
Cache GSS Mechanism (CCM) [CCM].
Expires: November 2003 Callaghan and Talpey [Page 22]
Internet-Draft RDMA Transport for ONC RPC May 2003
12. IANA Considerations
As a new RPC transport, this protocol should have no effect on RPC
program numbers or registered port numbers. The new RPC transport
should be assigned a new RPC "netid".
13. Acknowledgements
The authors wish to thank Rob Thurlow, John Howard, Chet Juszczak,
Alex Chiu, Peter Staubach, Dave Noveck, Brian Pawlowski, Steve
Kleiman, Mike Eisler, Mark Wittle and Shantanu Mehendale for their
contributions to this document.
14. References
[RDMA] R. Recio et al, "An RDMA Protocol Specification",
Internet Draft, February 2003,
http://www.ietf.org/internet-drafts/
draft-ietf-rddp-rdmap-00.txt
[CCM] M. Eisler, "CCM: The Credential Cache GSS Mechanism",
Internet Draft, February 2003,
http://www.ietf.org/internet-drafts/
draft-eisler-nfsv4-ccm-00.txt
[NFSRDMA]
T. Talpey, S. Shepler, "NFSv4 RDMA and Session Extensions"
http://www.ietf.org/internet-drafts/
draft-talpey-nfsv4-rdma-sess-00.txt
[NFSDDP]
B. Callaghan, T. Talpey, "NFS Direct Data Placement"
Internet Draft, May 2003,
http://www.ietf.org/internet-drafts/
draft-callaghan-nfsdirect-00.txt
[RFC1831]
R. Srinivasan, "RPC: Remote Procedure Call Protocol Specification
Version 2",
Standards Track RFC,
http://www.ietf.org/rfc/rfc1831.txt
[RFC1832]
R. Srinivasan, "XDR: External Data Representation Standard",
Standards Track RFC,
http://www.ietf.org/rfc/rfc1832.txt
Expires: November 2003 Callaghan and Talpey [Page 23]
Internet-Draft RDMA Transport for ONC RPC May 2003
[RFC1813]
B. Callaghan, B. Pawlowski, P. Staubach, "NFS Version 3 Protocol
Specification",
Informational RFC,
http://www.ietf.org/rfc/rfc1813.txt
[RFC3530]
S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M.
Eisler, D. Noveck, "NFS version 4 Protocol",
Standards Track RFC,
http://www.ietf.org/rfc/rfc3530.txt
[RFC2203]
M. Eisler, A. Chiu, L. Ling, "RPCSEC_GSS Protocol Specification",
Standards Track RFC,
http://www.ietf.org/rfc/rfc2203.txt
[RDDP]
Remote Direct Data Placement Working Group Charter,
http://www.ietf.org/html.charters/rddp-charter.html
[RDDPPS]
Remote Direct Data Placement Working Group Problem Statement,
A. Romanow, J. Mogul, T. Talpey, S. Bailey,
http://www.ietf.org/internet-drafts/
draft-ietf-rddp-problem-statement-00.txt
[IB]
Infiniband Architecture Specification,
http://www.infinibandta.org
15. Authors' Addresses
Brent Callaghan
Sun Microsystems, Inc.
17 Network Circle
Menlo Park, California 94025 USA
Phone: +1 650 786 5067
EMail: brent.callaghan@sun.com
Expires: November 2003 Callaghan and Talpey [Page 24]
Internet-Draft RDMA Transport for ONC RPC May 2003
Tom Talpey
Network Appliance, Inc.
375 Totten Pond Road
Waltham, MA 02451 USA
Phone: +1 781 768 5329
EMail: thomas.talpey@netapp.com
16. Full Copyright Statement
Copyright (C) The Internet Society (2003). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Expires: November 2003 Callaghan and Talpey [Page 25]