Internet DRAFT - draft-cel-nfsv4-reminv-design
draft-cel-nfsv4-reminv-design
Network File System Version 4 C. Lever
Internet-Draft Oracle
Intended status: Informational November 19, 2018
Expires: May 23, 2019
Using Remote Invalidation With RPC-Over-RDMA Transports
draft-cel-nfsv4-reminv-design-09
Abstract
Remote Invalidation relieves RDMA responders of some of the burden of
preparing memory to be accessed remotely, thus reducing the latency
of RDMA Read and Write operations. This document considers how to
introduce generic support for Remote Invalidation to RPC-over-RDMA
transport protocols.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 23, 2019.
Copyright Notice
Copyright (c) 2018 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Lever Expires May 23, 2019 [Page 1]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Requirements Language . . . . . . . . . . . . . . . . . . . . 4
3. General Requirements . . . . . . . . . . . . . . . . . . . . 4
3.1. Memory Management Extensions . . . . . . . . . . . . . . 4
3.2. Registration Types . . . . . . . . . . . . . . . . . . . 4
3.3. Selecting STags to Invalidate Remotely . . . . . . . . . 5
3.4. Future Enhancements . . . . . . . . . . . . . . . . . . . 6
4. Remote Invalidation in Operation . . . . . . . . . . . . . . 6
4.1. Determining Remote Invalidation Support Status . . . . . 7
4.2. Selection of Which STag to Invalidate Remotely . . . . . 8
4.3. Reverse-Direction Operation . . . . . . . . . . . . . . . 8
5. Protocol Elements . . . . . . . . . . . . . . . . . . . . . . 9
5.1. Per Protocol Version Remote Invalidation . . . . . . . . 9
5.2. Per Connection Remote Invalidation . . . . . . . . . . . 10
5.3. Fixed Protocol Remote Invalidation . . . . . . . . . . . 10
5.4. Per RPC Remote Invalidation (Single STag) . . . . . . . . 11
5.5. Per RPC Remote Invalidation (Multiple STags) . . . . . . 12
5.6. Inter-RPC Remote Invalidation . . . . . . . . . . . . . . 13
6. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 13
6.1. General Considerations . . . . . . . . . . . . . . . . . 13
6.2. Choosing a Protocol Extension . . . . . . . . . . . . . . 14
6.3. Example Remote Invalidation Protocol . . . . . . . . . . 15
6.4. Corner Cases . . . . . . . . . . . . . . . . . . . . . . 16
7. Security Considerations . . . . . . . . . . . . . . . . . . . 17
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18
9. References . . . . . . . . . . . . . . . . . . . . . . . . . 18
9.1. Normative References . . . . . . . . . . . . . . . . . . 18
9.2. Informative References . . . . . . . . . . . . . . . . . 18
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 19
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 19
1. Introduction
Similar to RDMA-enabled storage protocols such as iSER [RFC7145], an
RPC-over-RDMA version 1 requester exposes regions of its memory to an
RPC-over-RDMA responder. The responder then uses RDMA Read and Write
operations to transfer bulk data payloads [RFC8166].
In preparation for a bulk data transfer, a requester asks its RNIC to
assign a steering tag, or STag, to a region of memory containing the
data to be moved. At this time, access rights are granted that allow
the RNIC to access or update that memory on behalf of a remote peer.
This act is referred to as "memory registration." The RNIC uses this
STag to steer data to and from the registered memory region.
Lever Expires May 23, 2019 [Page 2]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
When data transfer is complete, each STag is dissociated from its
memory region. This act is referred to as "memory invalidation." It
prevents further responder access to that memory region by revoking
its remote access rights. Invalidation should be done before RPC
applications on the requester are allowed access to memory that was
involved in an explicit RDMA operation.
Before an RPC transaction is terminated, the requester is responsible
for fencing memory from the responder. This is a hard requirement by
the transport protocol [RFC8166]. Fencing serializes the completion
of RPC transactions with the invalidation of RPC-over-RDMA chunks.
Therefore the latency of invalidation adds to the total execution
time of each RPC transaction.
Remote Invalidation is a mechanism by which an RDMA peer can request
that the remote peer RNIC invalidate an STag associated with memory
on that remote peer [RFC5042]. An RDMA consumer requests Remote
Invalidation by posting an RDMA Send With Invalidate Work Request in
place of an RDMA Send Work Request. RDMA Send With Invalidate is
similar to RDMA Send, but takes one additional argument: a single
STag to be invalidated by the RNIC that receives the sent message.
The resulting RDMA Send operation is transmitted with additional
header information that conveys the STag that is to be invalidated
[RFC5040].
The benefit of Remote Invalidation is that the requester is not
required to post an additional Work Request, context switch, and
handle an interrupt to perform memory invalidation as part of
completing an RPC transaction. Memory invalidation is essentially
offloaded to the RNIC. The upshot is faster completion of RPC
transactions that involve registered memory.
This mechanism has the most impact when explicit RDMA operations are
needed to move moderate amounts of data. Invalidation latency is
quite small compared to the time it takes to convey a large payload
with an explicit RDMA operation. Small RPCs are already conveyed
entirely via RDMA Send, thus Remote Invalidation is unnecessary for
them. When the time it takes to invalidate a memory region is on the
same order as the time it takes to move the contents of that region,
Remote Invalidation has its greatest impact.
Remote Invalidation confers benefits similar to the benefits of
increasing the size of Send and Receive buffers. However, Remote
Invalidation does not incur the cost of maintaining a pool of large
Receive buffers on either the requester or responder. Moderate-sized
RPC payloads can be transferred without much of the cost of memory
registration. Requesters can rely on RDMA Write to structure their
Receive buffers without introducing additional latency.
Lever Expires May 23, 2019 [Page 3]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
There are some downsides, however. Remote Invalidation is not
available on all RNIC devices. And, Remote Invalidation does not
address the extra round trip latency incurred when using RDMA Read.
This extra latency can be eliminated using a large inline threshold
for transmitting RPC Calls.
The purpose of this document is to explore at a high level how Remote
Invalidation can be introduced into the RPC-over-RDMA transport
protocol. The primary design considerations for the transport
protocol are to provide a mechanism to indicate when Remote
Invalidation is safe to use, and to provide selection criteria for
choosing which STag (when there are more than one) to invalidate
remotely. Elements of the XDR definition of the RPC-over-RDMA
protocol will need to be altered to some degree, depending on desired
flexibility of operation, invasiveness of XDR changes, and breadth of
hardware support.
2. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP
14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
3. General Requirements
3.1. Memory Management Extensions
Remote Invalidation was not available in the original RDMA Verbs API.
New verbs API objects were specified that include operations that
enable Remote Invalidation, now described in [IBARCH]. The Verbs API
provides a capabilities flag, MEM_MGT_EXTENSIONS, that indicates that
an RNIC and the local verbs implementation can support these new APIs
and objects.
An STag that is registered using the FRWR mechanism (in a privileged
execution context) or is registered via a Memory Window (in a non-
privileged context) may be invalidated remotely [RFC5040]. These
mechanisms are available when an RNIC supports MEM_MGT_EXTENSIONS.
RDMA Send With Invalidate is available only with MEM_MGT_EXTENSIONS.
3.2. Registration Types
For the purposes of this discussion, there are two classes of STags.
Dynamically-registered STags are used in a single RPC, then
invalidated. Persistently-registered STags are used in multiple RPC
Lever Expires May 23, 2019 [Page 4]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
transactions. They may persist for the life of an RPC-over-RDMA
connection, or longer.
In RPC-over-RDMA version 1, a requester may provide more than one
STag in the chunk lists of an RPC. Therefore a requester may provide
any combination of the following registration types in one RPC, any
combination of these in a series of RPCs on the same connection, or
it may use some other registration model.
Examples of persistently-registered STags include:
o The device's reserved DMA R_key
o An STag registered for a connection that doesn't change from RPC
to RPC (for a utility buffer, say)
o An STag registered for a fixed memory region that is updated after
each time it is advertised
o An STag covering a large single region that is utilized in small
segments by many RPCs
Examples of dynamically-registered STags include:
o An STag registered for a single RPC transaction using a legacy
registration mechanism, then invalidated when the RPC is retired
o An STag registered for a single RPC transaction using either
Memory Windows or FRWR, then invalidated when the RPC is retired
Among these examples, only dynamically-registered STags using Memory
Windows or FRWR may be invalidated remotely.
3.3. Selecting STags to Invalidate Remotely
Remote Invalidation protocol mechanisms come in different styles:
Fixed Protocol
The rules by which a responder selects which STag to invalidate
remotely is fixed in the protocol specification.
Responder's Choice
The responder chooses an STag to invalidate remotely from among
all the STags in incoming requests.
Requester's Choice
The requester chooses one or more STags that may be invalidated
remotely, indicating its choices in each request. The responder
Lever Expires May 23, 2019 [Page 5]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
chooses an STag to invalidate remotely from among the requester's
picks.
There is no RDMA layer mechanism by which a responder can determine
how a requester-provided STag was registered. Thus a requester that
mixes persistently- and dynamically-registered STags in one RPC, or
mixes them across RPCs on the same connection, cannot tolerate
Responder's Choice.
3.4. Future Enhancements
There are two related enhancements that further reduce the effort
needed to invalidate STags associated with complex RPCs:
o The ability for one registered STag to represent a list of memory
regions that are not contiguous
o The ability to specify more than one remote STag in a single Send
Work Request to be remotely invalidated
At this time, the first mechanism has been implemented in at least
one RNIC on the market. The second is speculative (i.e., has not yet
been implemented anywhere).
Given support for registering non-contiguous memory regions with one
STag, when an RPC-over-RDMA requester constructs an RPC that has both
a Read list and a Write list, the requester has a choice:
o The requester can register a separate STag for each access mode
(one STag for memory regions needing read access, and one STag for
those needing write access) to provide good data security
o The requester can register a single STag with read and write
access enabled for the whole set of memory regions, to allow RDMA
Send With Invalidate to work optimally
Having the ability to remotely invalidate multiple STags at once
enables the combination of optimal performance and optimal security.
4. Remote Invalidation in Operation
When requester memory is registered for remote access, an RPC-over-
RDMA implementation can use Remote Invalidation by following these
steps:
1. The requester DMA-maps a memory region that will participate in
an RPC transaction, then registers an STag for that region.
Lever Expires May 23, 2019 [Page 6]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
2. The requester transmits the RPC Call to the responder. This
request conveys the STag to the responder.
3. The responder processes the RPC transaction. The peer RNICs use
the STag to move RPC arguments and/or results.
4. The responder transmits the RPC Reply using an RDMA Send With
Invalidate Work Request, setting the Work Request's inv_handle
field to the value of the STag.
5. A Receive Work Request completes on the requester, carrying this
RPC reply. The completion reports the invalidated STag.
6. The requester skips invalidation of the STag, then DMA-unmaps the
memory region associated with the STag.
The requester no longer needs to invalidate the STag involved with
this RPC. However, there are additional details that must be
resolved before the use of Remote Invalidation can commence.
4.1. Determining Remote Invalidation Support Status
A requester that does not support Remote Invalidation might not
tolerate the use of RDMA Send With Invalidate by a responder. Such a
requester performs Local Invalidation on STags that already happen to
be invalid. In some cases this results in protection errors or other
issues.
Thus, to avoid spurious connection termination, a responder must not
post an RDMA Send With Invalidate Work Request unless it is sure the
following three conditions are met:
o The requester's RNIC is prepared to receive the additional header
information associated with Remote Invalidation
o The requester has used an appropriate registration mechanism to
register STags it wants invalidated remotely
o The requester is prepared to recognize remotely invalidated STags
during Receive processing to avoid invalidating them a second time
When all three of these conditions are met, a requester can report
positive Remote Invalidation support status to responders using an
Upper Layer Protocol mechanism. When a responder does not know the
requester's Remote Invalidation support status, it cannot use Remote
Invalidation without endangering the connection.
Lever Expires May 23, 2019 [Page 7]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
4.2. Selection of Which STag to Invalidate Remotely
The RDMA Send With Invalidate Work Request invalidates only one STag.
RPC-over-RDMA requesters may register more than one STag to handle
the movement of payloads for a single RPC. Either the client will
have to specify which STag may be remotely invalidated, the protocol
will have to specify a fixed way to select which STag to invalidate,
or the responder will have to choose arbitrarily which STag to
remotely invalidate.
In some circumstances, requesters may wish to utilize STags during
transactions that are registered using a mechanism that does not
tolerate Remote Invalidation. For example, an STag that is the
requester's local DMA R_key should never be invalidated remotely. If
a responder attempts to invalidate a such an STag, the result is
undefined, but the connection may be terminated or other failures can
occur.
Even with Remote Invalidation enabled, requesters remain responsible
for ensuring all STags are invalid before RPC transactions complete.
To avoid leaving STags registered, a requester must be prepared for
the responder or the requester's own RNIC to have not invalidated any
of an RPC's STags. When there are multiple STags associated with a
single RPC, a requester must be prepared for any of the STags to have
been remotely invalidated or that all of the RPC's STags remain
registered.
4.3. Reverse-Direction Operation
As of this writing, no current RPC-over-RDMA implementation supports
direct data placement in the reverse direction. However, existing
protocol specifications do not forbid it [RFC8166] [RFC8167]
[I-D.cel-nfsv4-rpcrdma-version-two].
When chunks are present in a reverse-direction RPC request, Remote
Invalidation allows the responder to trigger invalidation of a
requester's STags as part of sending a reply, the same as in the
forward direction.
However, in the reverse direction, the server acts as the requester,
and the client is the responder. The server's RNIC, therefore, must
support receiving an IETH, and the server must have registered the
STags with an appropriate registration mechanism. Thus the server
must indicate its Remote Invalidation support status to the client
(the opposite of forward direction Remote Invalidation).
Lever Expires May 23, 2019 [Page 8]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
5. Protocol Elements
In this section, a number of abstract protocol variations are
considered. These vary in functionality and the invasiveness of
changes to the tranport protocol's XDR definition. Some of these
variations might be appropriate to use in combination.
5.1. Per Protocol Version Remote Invalidation
5.1.1. Description
When a higher protocol version number is negotiated, Remote
Invalidation is always enabled. Both peers assume that Remote
Invalidation may be used in either direction.
5.1.2. Similar Existing Implementations
SMB Direct [MS-SMBD]
5.1.3. Advantages
No XDR changes or protocol extensions are required.
Reverse direction use of Remote Invalidation is automatically
supported.
5.1.4. Disadvantages
The requester is not in control of which STags in an RPC may be
invalidated. Thus, a requester must not advertise STags which must
never be invalidated, or the protocol must specify a fixed choice of
which STag(s) in each request are allowed to be invalidated remotely.
This new protocol version would then be usable only with RNICs that
support Remote Invalidation. Other features and benefits of the new
protocol version would not be available when an implementation
employs an RNIC that does not support Remote Invalidation. In
particular, RNICs that do not support MEM_MGT_EXTENTIONS could not
use the new protocol version.
An extension or addition protocol version bump is required to
indicate support for transport-level mechanisms that can invalidate
multiple STags at once.
Lever Expires May 23, 2019 [Page 9]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
5.2. Per Connection Remote Invalidation
5.2.1. Description
At connection initiation time, messages are exchanged that indicate
each peer's Remote Invalidation support status. Without these
messages, peers assume Remote Invalidation is not supported.
5.2.2. Similar Existing Implementations
iSER [RFC7145]. Information is exchanged in RDMA-CM connection
requests to report an implementation's Remote Invalidation support
status.
5.2.3. Advantages
No changes to the base protocol XDR are required.
5.2.4. Disadvantages
Out-of-band messages are required to establish Remote Invalidation
support status.
The requester is not in control of which STags in an RPC may be
invalidated. Thus, a requester must not advertise STags which must
never be invalidated.
To support reverse-direction operation, the server must separately
indicate that it supports Remote Invalidation.
To enable support for multiple STag invalidation, this negotiation
protocol would have to be extended again to indicate when mechanisms
other than RDMA Send With Invalidate are supported by the requester's
RNIC.
5.3. Fixed Protocol Remote Invalidation
5.3.1. Description
Protocol specification determines how the responder chooses which
STag is to be invalidated remotely. Some other means is used to
determine whether Remote Invalidation can be used or not.
5.3.2. Similar Existing Implementations
iSER [RFC7145]. Two STags fields appear in each request: one
advertises Read data and one advertises Write data. When only one
Lever Expires May 23, 2019 [Page 10]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
STag is used in the request, it may be invalidated remotely. One
both STags are used, only the Read STag may be invalidated remotely.
SMB Direct [MS-SMBD]. The responder always chooses the first STag in
each request to be invalidated remotely.
5.3.3. Advantages
No changes to the base protocol XDR are required.
5.3.4. Disadvantages
Out-of-band messages are required to establish support status.
The requester is not in control of which STags in an RPC may be
invalidated. Thus, a requester must not advertise STags which must
never be invalidated.
This mechanism may not work well for transport protocols that allow
multiple read and write STags.
5.4. Per RPC Remote Invalidation (Single STag)
5.4.1. Description
A field is added to the transport header that contains an STag which
may be invalidated by the responder. A special value can be chosen
to mean "no STag may be invalidated" for use by requesters that have
no support for Remote Invalidation.
5.4.2. Similar Existing Implementations
None.
5.4.3. Advantages
A requester may advertise STags that cannot be invalidated remotely,
as long as they are never marked as "may invalidate."
No out-of-band support status negotiation is needed.
Reverse-direction RPCs can each indicate whether a reverse-direction
requester desires or does not support Remote Invalidation.
The responder needs no special logic or assumptions to choose the
STag to invalidate remotely.
Lever Expires May 23, 2019 [Page 11]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
5.4.4. Disadvantages
Either the base RPC-over-RDMA header XDR definition is altered, or a
protocol extension is required.
Requesters transmit a little extra data per RPC, making RPC-over-RDMA
messages slightly more costly to send and parse.
This mechanism cannot support the remote invalidation of multiple
STags at once.
5.5. Per RPC Remote Invalidation (Multiple STags)
5.5.1. Description
A new data structure is added to the transport header that indicate
which STags which may be invalidated by the responder.
This information might appear as a new field in the RDMA segment data
structure, as each segment has its own STag field. The field
indicates whether or not that STag may be invalidated by the
responder. Perhaps that field is a boolean, though in XDR, a boolean
is a full 32 bits.
Or, this information could appear in the header as an array of STags,
to reduce the amount of extra data contained in the RPC-over-RDMA
header. Zero array elements means the requester does not support
Remote Invalidation.
5.5.2. Similar Existing Implementations
NVMe/Fabrics [NVME]. Each STag in a request has an associated bit
flag that indicates whether the responder is allowed to invalidate it
remotely.
5.5.3. Advantages
A requester may advertise STags that cannot be invalidated remotely,
as long as they are never marked as "may invalidate."
The mechanism allows a requester to request either invalidation of
multiple STags at once, or to choose one STag to invalidate remotely.
No out-of-band support status negotiation is needed.
Each reverse-direction RPC can indicate whether a reverse-direction
requester desires or does not support Remote Invalidation.
Lever Expires May 23, 2019 [Page 12]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
The responder needs no special logic or assumptions to choose the
STag to invalidate remotely.
5.5.4. Disadvantages
The RPC-over-RDMA header XDR definition is possibly extensively
altered.
Requesters transmit extra data per RPC. However, it is limited to
only one or two 32-bit words in most cases.
5.6. Inter-RPC Remote Invalidation
5.6.1. Description
As a subfeature of support for Remote Invalidation, it is possible
that a responder can remotely invalidate an STag (using RDMA Send
With Invalidate) that refers to registered memory being used in the
Read chunk of a different RPC. Such Remote Invalidation would be
requested only after the responder has already completed its RDMA
Read.
This can be useful when a responder is replying to an RPC via an
inline message, but notices there are other RPC replies pending that
have multiple STags, some of which are Read chunks.
5.6.2. Similar Existing Implementations
None
5.6.3. Advantages
This is one way to enable remote invalidation of multiple STags per
RPC, using only RDMA Send With Invalidate.
5.6.4. Disadvantages
Additional requester and responder complexity would be required to
keep track of STags.
6. Recommendations
6.1. General Considerations
When constructing a protocol to support Remote Invalidation, one of
the above designs, or some combination of them, may be chosen.
Lever Expires May 23, 2019 [Page 13]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
In no particular order, the author feels that the design priorities
are:
o Do not prevent the efficient operation of RNICs that do not handle
RDMA Send With Invalidate
o Introduce as little impact on header XDR and header length as
possible, to keep collateral performance and complexity impacts
low
o Enable support for Remote Invalidation when explicit RDMA is used
in reverse-direction RPCs.
An important question is whether the base RPC-over-RDMA protocol
should support Remote Invalidation, whether Remote Invalidation
support should be carried entirely on the shoulders of protocol
extensions, or whether some combination of the two is best.
Upper Layer Protocols will likely always be responsible for some
degree of signaling Remote Invalidation capabilities, as long as
innovation continues at the transport layer (e.g., new RDMA
operations that enable multi-STag Remote Invalidation). Predicting
future hardware capabilities is challenging, limiting the ability to
design long-lived protocol support for them.
Lastly, it is difficult to estimate how long the industry must
continue to support less capable devices.
6.2. Choosing a Protocol Extension
All things being equal, making no changes to the base XDR definition
has great appeal. If the mechanism in Section 5.2 can be broadly
effective at enabling Remote Invalidation in the current set of RPC-
over-RDMA implementations, it would be the proper choice.
Unfortunately, among current RPC-over-RDMA client implementations,
there is one client that can immediately use a per-connection style
protocol, and one that can use only a per-RPC style protocol such as
Section 5.4. A third known client resides in user space and uses FMR
registration, thus it is incapable of immediately employing Remote
Invalidation.
Because there is a wide latitude of implementation choice already
allowed by the RPC-over-RDMA transport protocol, the author's
preference is to implement Section 5.4. The target STag can be added
to the RPC-over-RDMA transport as a single field in a new version of
the RPC-over-RDMA transport protocol. No further changes or
extensions are needed.
Lever Expires May 23, 2019 [Page 14]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
In the longer term, the requester appears to be in the better
position to determine which STag may be invalidated remotely. With
this mechanism, the requester can choose based on which STags may be
invalidated remotely, or may use criteria based on the strengths of
its RNIC. For instance, choosing the largest registered memory
region might be beneficial in some cases.
Allowing the responder to select from among several choices does not
seem to bring additional value, and burdens the responder with
additional header parsing costs for each chunk-bearing RPC
transaction.
Furthermore, the ability to request Remote Invalidation of multiple
STags in a single Work Request appears to be somewhat distant. It
would require additional Upper Layer Protocol mechanisms to
distinguish the new mechanism from using RDMA Send With Invalidate,
which we are not in a position to design today. Thus it does not
seem worth the extra implementation and protocol complexity of having
the requester provide a list of STags for the responder to choose
from.
As an alternative to modifying the XDR definition for the RDMA_MSG
and RDMA_NOMSG message types, a new RDMA message type could be
introduced in a new version of RPC-over-RDMA that provides similar
functionality to RDMA_MSG and RDMA_NOMSG but adds one or more new
fields. This has the advantage of leaving the version 1-compatible
parts of the the new XDR definition unchanged. It is an open
question whether this introduces more complexity to existing
implementations than adding new fields to RDMA_MSG and RDMA_NOMSG.
However, this approach is similar to the introduction of READ_PLUS in
the specification of NFSv4.2 [RFC7862].
Allowing the feature described in Section 5.6 is likely to increase
the complexity of responder and especially requester implementations,
as they would have to remember invalidated STags independently of RPC
completions. Because it does not require any XDR changes, it could
easily be enabled in a future protocol extension. The author's
preference is to forbid this behavior in the initial specification,
but allow for a future extension to introduce it.
6.3. Example Remote Invalidation Protocol
As an example of how to proceed, the simplest approach would replace
struct rpcrdma2_chunk_lists (as defined in
[I-D.cel-nfsv4-rpcrdma-version-two]) with the following:
Lever Expires May 23, 2019 [Page 15]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
<CODE BEGINS>
struct rpcrdma2_chunk_lists {
enum msg_type rdma_direction;
u32 rdma_inv_handle;
struct rpcrdma2_read_list *rdma_reads;
struct rpcrdma2_write_list *rdma_writes;
struct rpcrdma2_write_chunk *rdma_reply;
};
<CODE ENDS>
The following language describes how to utilize the new field:
The requester sets the value of the rdma_inv_handle field to the
value of any one of the rdma_handle fields in the RPC-over-RDMA
header of the RPC Call that may be invalidated remotely. If the
RPC-over-RDMA header of the RPC Call contains no rdma_handles that
may be invalidated remotely, the requester MUST set the value of
the rdma_inv_handle field to zero.
If the rdma_inv_handle field in the RPC-over-RDMA header of an RPC
Call contains zero, the responder MUST NOT use RDMA Send With
Invalidate to transmit the matching RPC Reply. Otherwise, the
responder SHOULD use RDMA Send With Invalidate to transmit the RPC
Reply, specifying the value in the RPC-over-RDMA header's
rdma_inv_handle field as the Send With Invalidate Work Request's
inv_rkey.
6.4. Corner Cases
A remote invalidation-enabled client remains responsible for
protecting its registered memory even when there is no Reply.
Consider these important corner cases:
o The responder never sends a response to Call-only procedures, thus
there is no opportunity for remote invalidation. Moreover, if the
transport protocol has no RDMA_DONE message, requesters cannot
know when they may safely invalidate registered memory used for
Call arguments. Therefore memory registration should not be used
for RPC procedures that do not expect a Reply.
o The RPC Reply is lost but the responder is still functional. In
some cases, the Upper Layer Protocol requires that the responder
close the connection to signal the loss of an RPC transaction.
This renders existing STags invalid.
Lever Expires May 23, 2019 [Page 16]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
o An application on a client is interrupted before the RPC Reply
completes on the requester, or the RPC transaction times out
waiting for a Reply. This exposes a race condition:
* The MR has already been invalidated by the requester when the
RPC Reply arrives at the RNIC. Typically this results in a
Memory Management Operation error, and the QP is placed in the
Error state.
* The MR has already been invalidated by the RNIC when the
requester invalidates locally. This also typically results in
a Memory Management Operation error, and the QP is placed in
the Error state.
A protocol mechanism that enables the requester to indicate to the
responder that an RPC transaction has been canceled can be used to
avoid this race. Otherwise, the requester and responder
implementations must tolerate connection loss and re-
establishment.
7. Security Considerations
Remote Invalidation metadata is conveyed in the clear in RPC-over-
RDMA headers. This does not expose any new information to attackers.
A man-in-the-middle can alter Remote Invalidation metadata while it
is in transit. Requesters are prepared to handle the case where
responders have not invalidated any STags associated with an RPC. An
attacker can cause other STags in flight to be invalidated before the
responder is finished with the associated memory. Or an attacker can
replace the "to-be invalidated" STag with an STag in the same RPC
that should not be invalidated remotely. Any of these might cause
loss of connection or other failures, triggering a denial-of-service
situation.
A connection relationship is required to exist between a requester
and a responder. The requester's RNIC has associated a Protection
Domain with that connection. The STag on the requester to be
invalidated is associated with that Protection Domain. This protects
against arbitrary invalidation of STags by network nodes not part of
the connection.
Further discussion appears in [RFC5042].
Lever Expires May 23, 2019 [Page 17]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
8. IANA Considerations
This document does not require actions by IANA.
9. References
9.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC5040] Recio, R., Metzler, B., Culley, P., Hilland, J., and D.
Garcia, "A Remote Direct Memory Access Protocol
Specification", RFC 5040, DOI 10.17487/RFC5040, October
2007, <https://www.rfc-editor.org/info/rfc5040>.
[RFC5042] Pinkerton, J. and E. Deleganes, "Direct Data Placement
Protocol (DDP) / Remote Direct Memory Access Protocol
(RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October
2007, <https://www.rfc-editor.org/info/rfc5042>.
[RFC8166] Lever, C., Ed., Simpson, W., and T. Talpey, "Remote Direct
Memory Access Transport for Remote Procedure Call Version
1", RFC 8166, DOI 10.17487/RFC8166, June 2017,
<https://www.rfc-editor.org/info/rfc8166>.
[RFC8167] Lever, C., "Bidirectional Remote Procedure Call on RPC-
over-RDMA Transports", RFC 8167, DOI 10.17487/RFC8167,
June 2017, <https://www.rfc-editor.org/info/rfc8167>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/info/rfc8174>.
9.2. Informative References
[I-D.cel-nfsv4-rpcrdma-version-two]
Lever, C. and D. Noveck, "RPC-over-RDMA Version 2
Protocol", draft-cel-nfsv4-rpcrdma-version-two-08 (work in
progress), November 2018.
[IBARCH] InfiniBand Trade Association, "InfiniBand Architecture
Specification Volume 1", Release 1.3, March 2015,
<http://www.infinibandta.org/content/
pages.php?pg=technology_download>.
Lever Expires May 23, 2019 [Page 18]
Internet-Draft RPC-over-RDMA Remote Invalidation November 2018
[MS-SMBD] Microsoft Corporation, "SMB Remote Direct Memory Access
(RDMA) Transport Protocol Specification", July 2016.
[NVME] NVM Express, Inc., "NVM Express Revision 1.2.1", July
2016.
[RFC7145] Ko, M. and A. Nezhinsky, "Internet Small Computer System
Interface (iSCSI) Extensions for the Remote Direct Memory
Access (RDMA) Specification", RFC 7145,
DOI 10.17487/RFC7145, April 2014,
<https://www.rfc-editor.org/info/rfc7145>.
[RFC7862] Haynes, T., "Network File System (NFS) Version 4 Minor
Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862,
November 2016, <https://www.rfc-editor.org/info/rfc7862>.
Acknowledgments
The author wishes to thank Sagi Grimberg, Christoph Hellwig, Karen
Deitke, Dave Noveck, and Tom Talpey. The author also wishes to thank
Bill Baker and Greg Marsden for their support of this work.
Special thanks go to Transport Area Director Spencer Dawkins, NFSV4
Working Group Chairs Spencer Shepler and Brian Pawlowski, and NFSV4
Working Group Secretary Thomas Haynes for their support.
Author's Address
Charles Lever
Oracle Corporation
1015 Granger Avenue
Ann Arbor, MI 48104
United States of America
Email: chuck.lever@oracle.com
Lever Expires May 23, 2019 [Page 19]