NFSv4 (provisionally) | T. Talpey |
Internet-Draft | Microsoft |
Updates: 5040 7306 (if approved) | T. Hurson |
Intended status: Standards Track | Intel |
Expires: September 10, 2020 | G. Agarwal |
Marvell | |
T. Reu | |
Chelsio | |
March 9, 2020 |
RDMA Extensions for Enhanced Memory Placement
draft-talpey-rdma-commit-01
This document specifies extensions to RDMA (Remote Direct Memory Access) protocols to provide capabilities in support of enhanced remotely-directed data placement on persistent memory-addressable devices. The extensions include new operations supporting remote commitment to persistence of remotely-managed buffers, which can provide enhanced guarantees and improve performance for low-latency storage applications. In addition to, and in support of these, extensions to local behaviors are described, which may be used to guide implementation, and to ease adoption. This document updates RFC5040 (Remote Direct Memory Access Protocol (RDMAP)) and updates RFC7306 (RDMA Protocol Extensions).
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 10, 2020.
Copyright (c) 2020 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document.
The RDMA Protocol (RDMAP) and RDMA Protocol Extensions (RDMAPEXT) provide capabilities for secure, zero copy data communications that preserve memory protection semantics, enabling more efficient network protocol implementations. The RDMA Protocol is part of the iWARP family of specifications which also include the Direct Data Placement Protocol (DDP), and others as described in the relevant documents. For additional background on RDMA Protocol applicability, see "Applicability of Remote Direct Memory Access Protocol (RDMA) and Direct Data Placement Protocol (DDP)" RFC5045.
RDMA protocols are enjoying good success in improving the performance of remote storage access, and have been well-suited to semantics and latencies of existing storage solutions. However, new storage solutions are emerging with much lower latencies, driving new workloads and new performance requirements. Also, storage programming paradigms SNIANVMP are driving new requirements of the remote storage layers, in addition to driving down latency tolerances. Overcoming these latencies, and providing the means to achieve persistence and/or visibility without invoking upper layers and remote CPUs for each such request, are the motivators for the extensions in this document.
This document specifies the following extensions to the RDMA Protocol (RDMAP) and its local memory ecosystem:
The extensions defined in this document do not require the RDMAP version to change.
This document is an extension of RFC 5040 and RFC7306, and key words are additionally defined in the glossaries of the referenced documents.
The following additional terms are used in this document as defined.
RDMA is widely deployed in support of storage and shared memory over increasingly low-latency and high-bandwidth networks. The state of the art today yields end-to-end network latencies on the order of one to two microseconds for message transfer, and bandwidths exceeding 100 gigabit/s. These bandwidths are expected to increase over time, with latencies decreasing as a direct result.
In storage, another trend is emerging - greatly reduced latency of persistently storing data blocks. While best-of-class Hard Disk Drives (HDDs) have delivered average latencies of several milliseconds for many years, Solid State Disks (SSDs) have improved this by one to two orders of magnitude. Technologies such as NVM Express NVMe yield even higher-performing results by eliminating the traditional storage interconnect. The latest technologies providing memory-based persistence, such as Nonvolatile Memory DIMM NVDIMM, places storage-like semantics directly on the memory bus, reducing latency to less than a microsecond and increasing bandwidth to potentially many tens of gigabyte/s. [supporting data to be added]
RDMA protocols, in turn, are used for many storage protocols, including NFS/RDMA RFC5661 RFC8166 RFC8267, SMB Direct MS-SMB2 MS-SMBD and iSER RFC7145, to name just a few. These protocols allow storage and computing peers to take full advantage of these highly performant networks and storage technologies to achieve remarkable throughput, while minimizing the CPU overhead needed to drive their workloads. This leaves more computing resources available for the applications, which in turn can scale to even greater levels. Within the context of Cloud-based environments, and through scale-out approaches, this can directly reduce the number of servers that need to be deployed, making such attributes highly compelling.
However, limiting factors come into play when deploying ultra-low latency storage in such environments:
The first issue is fundamental, and due to the nature of serial, shared communication channels, presents challenges that are not easily bypassed. Communication cannot exceed the speed of light, for example, and serialization/deserialization plus packet processing adds further delay. Therefore, an RDMA solution which offloads and reduces the overhead of exchanges which encounter such latencies is highly desirable.
The second issue requires that outbound transfers be made as efficient as possible, so that replication of data can be done with minimal overhead and delay (latency). A reliable "push" RDMA transfer method is highly suited to this.
The third issue requires that the transfer be performed without an upper-layer exchange required. Within security contraints, RDMA transfers, arbitrated only by lower layers into well-defined and pre-advertised buffers, present an ideal solution.
The fourth issue requires significant CPU activity, consuming power and valuable resources, and may not be guaranteed by the RDMA protocols, which make no requirement of the order in which certain received data is placed or becomes visible; such guarantees are made only after signaling a completion to upper layers.
The RDMAP and DDP protocols, together, provide data transfer semantics with certain consistency guarantees to both the sender and receiver. Delivery of data transferred by these protocols is said to have been Placed in destination buffers upon Completion of specific operations. In general, these guarantees are limited to the visibility of the transferred data within the hardware domain of the receiver (data sink). Significantly, the guarantees do not necessarily extend to the actual storage of the data in memory cells, nor do they convey any guarantee that the data integrity is intact, nor that it remains present after a catastrophic failure. These guarantees may be provided by upper layers, such as the ones mentioned, after processing the Completions, and performing the necessary operations.
The NFSv4.1, SMB3 and iSER protocols are, respectively, file and block oriented, and have been used extensively for providing access to hard disk and solid state flash drive media. Such devices incur certain latencies in their operation, from the millisecond-order rotational and seek delays of rotating disk hardware, or the 100-microsecond-order erase/write and translation layers of solid state flash. These file and block protocols have benefited from the increased bandwidth, lower latency, and markedly lower CPU overhead of RDMA to provide excellent performance for such media, approximately 30-50 microseconds for 4KB writes in leading implementations.
These protocols employ a "pull" model for write: the client, or initiator, sends an upper layer write request which contains an RDMA reference to the data to be written. The upper layer protocols encode this as one or more memory regions. The server, or target, then prepares the request for local write execution, and "pulls" the data with an RDMA Read. After processing the write, a response is returned. There are therefore two or more roundtrips on the RDMA network in processing the request. This is desirable for several reasons, as described in the relevant specifications, but it incurs latency. However, since as mentioned the network latency has been so much less than the storage processing, this has been a sound approach.
Today, a new class of Storage Class Memory is emerging, in the form of Non-Volatile DIMM and NVM Express devices, among others. These devices are characterized by further reduced latencies, in the 10-microsecond-order range for NVMe, and sub-microsecond for NVDIMM. The 30-50 microsecond write latencies of the above file and block protocols are therefore from one to two orders of magnitude larger than the storage media! The client/server processing model of traditional storage protocols are no longer amortizable at an acceptable level into the overall latency of storage access, due to their requiring request/response communication, CPU processing by the both server and client (or target and initiator), and the interrupts to signal such requests.
Another important property of certain such devices is the requirement for explicitly requesting that the data written to them be made persistent. Because persistence requires that data be committed to memory cells, it is a relatively expensive operation in time (and power), and in order to maintain the highest device throughput and most efficient operation, the device "commit" operation is explicit. When the data is written by an application on the local platform, this responsibility naturally falls to that application (and the CPU on which it runs). However, when data is written by current RDMA protocols, no such semantic is provided. As a result, upper layer stacks, and the target CPU, must be invoked to perform it, adding overhead and latency that is now highly undesirable.
When such devices are deployed as the remote server, or target, storage, and when such a persistence can be requested and guaranteed remotely, a new transfer model can be considered. Instead of relying on the server, or target, to perform requested processing and to reply after the data is persistently stored, it becomes desirable for the client, or initiator, to perform these operations itself. By altering the transfer models to support a "push mode", that is, by allowing the requestor to push data with RDMA Write and subsequently make it persistent, a full round trip can be eliminated from the operation. Additionally, the signaling, and processing overheads at the remote peer (server or target) can be eliminated. This becomes an extremely compelling latency advantage.
In DDP (RFC5041), data is considered "placed" when it is submitted by the RNIC to the system. This operation is commonly an i/o bus write, e.g. via PCI. The submission is ordered, but there is no confirmation or necessary guarantee that the data has yet reached its destination, nor become visible to other devices in the system. The data will eventually do so, but possibly at a later time. The act of "delivery", on the other hand, offers a stronger semantic, guaranteeing that not only have prior operations been executed, but also guaranteeing any data is in a consistent and visible state. Generally however, such "delivery" requires raising a completion event, necessarily involving the host CPU. This is a relatively expensive, and latency-bound operation. Some systems perform "DMA snooping" to provide a somewhat higher guarantee of visibility after delivery and without CPU intervention, but others do not. The RDMA requirements remain the same, therefore, upper layers may make no broad assumption. Such platform behaviors, in any case, do not address persistence.
The extensions in this document primarily address a new "flush to persistence" RDMA operation. This operation, when invoked by a connected remote RDMA peer, can be used to request that previously-written data be moved into the persistent storage domain. This may be a simple flush to a memory cell, or it may require movement across one or more busses within the target platform, followed by an explicit persistence operation. Such matters are beyond the scope of this specification, which provides only the mechanism to request the operation, and to signal its successful completion.
In a similar vein, many applications desire to achieve visibility of remotely-provided data, and to do so with minimum latency. One example of such applications is "network shared memory", where publish-subscribe access to network-accessible buffers is shared by multiple peers, possibly from applications on the platform hosting the buffers, and others via network connection. There may therefore be multiple local devices accessing the buffer - for example, CPUs, and other RNICs. The topology of the hosting platform may be complex, with multiple i/o, memory, and interconnect busses, requiring multiple intervening steps to process arriving data.
To address this, the extension additionally provides a "flush to global visibility", which requires the RNIC to perform platform-dependent processing in order to guarantee that the contents of a specific range are visible for all devices that access them. On certain highly-consistent platforms, this may be provided natively. On others, it may require platform-specific processing, to flush data from volatile caches, invalidate stale cached data from others, and to empty queued pending operations. Ideally, but not universally, this processing will take place without CPU intervention. With a global visibility guarantee, network shared memory and similar applications will be assured of broader compatibility and lower latency across all hardware platforms.
Subsequently, many applications will seek to obtain a guarantee that the integrity of the data has been preserved after it has been flushed to a persistent or globally visible state. This may be enforced at any time. Unlike traditional block-based storage, the data provided by RDMA is neither structured nor segmented, and is therefore not self-describing with respect to integrity. Only the originator of the data, or an upper layer, is in possession of that. Applications requiring such guarantees may include filesystem or database logwriters, replication agents, etc.
To provide an additional integrity guarantee, a new operation is provided by the extension, which will calculate, and optionally compare an integrity value for an arbitrary region. The operation is ordered with respect to preceding and subsequent operations, allowing for a request pipeline without "bubbles" - roundtrip delays to ascertain success or failure.
Finally, once data has been transmitted and directly placed by RDMA, flushed to its final state, and its integrity verified, applications will seek to commit the result with a transaction semantic. The previous application examples apply here, logwriters and replication are key, and both are highly latency- and integrity-sensitive. They desire a pipelined transaction marker which is placed atomically to indicate the validity of the preceding operations. They may require that the data be in a persistent and/or globally visibile state, before placing this marker.
Together the above discussion argues for a new "one sided" transfer model supporting extended remote placement guarantees, provided by the RDMA transport, and used directly by upper layers on a data source, to control persistent storage of data on a remote data sink without requiring its remote interaction. Existing, or new, upper layers can use such a model in several ways, and evolutionary steps to support persistence guarantees without required protocol changes are explored in the remainder of this document.
Note that is intended that the requirements and concept of these extensions can be applied to any similar RDMA protocol, and that a compatible model can be applied broadly.
The fundamental new requirement for extending RDMA protocols is to define the property of persistence. This new property is to be expressed by new operations to extend Placement as defined in existing RDMA protocols. The RFC5040 protocols specify that Placement means that the data is visible consistently within a platform-defined domain on which the buffer resides, and to remote peers across the network via RDMA to an adapter within the domain. In modern hardware designs, this buffer can reside in memory, or also in cache, if that cache is part of the hardware consistency domain. Many designs use such caches extensively to improve performance of local access.
Persistence, by contrast, requires that the buffer contents be preserved across catastrophic failures. While it is possible for caches to be persistent, they are typically not, or they provide the persistence guarantee for a limited period of time, for example, while backup power is applied. Efficient designs, in fact, lead most implementations to simply make them volatile. In these designs, an explicit flush operation (writing dirty data from caches), often followed by an explicit commit (ensuring the data has reached its destination and is in a persistent state), is required to provide this guarantee. In some platforms, these operations may be combined.
For the RDMA protocol to remotely provide such guarantees, an extension is required. Note that this does not imply support for persistence or global visibility by the RDMA hardware implementation itself; it is entirely acceptable for the RDMA implementation to request these from another subsystem, for example, by requesting that the CPU perform the flush and commit, or that the destination memory device do so. But, in an ideal implementation, the RDMA implementation will be able to act as a master and provide these services without further work requests local to the data sink. Note, it is possible that different buffers will require different processing, for example one buffer may reside in persistent memory, while another may place its blocks in a storage device. Many such memory-addressable designs are entering the market, from NVDIMM to NVMe and even to SSDs and hard drives.
Therefore, additionally any local memory registration primitive will be enhanced to specify new optional placement attributes, along with any local information required to achieve them. These attributes do not explicitly traverse the network - like existing local memory registration, the region is fully described by a { STag, Tagged offset, length } descriptor, and such aspects of the local physical address, memory type, protection (remote read, remote write, protection key), etc are not instantiated in the protocol. Indeed, each local RDMA implementation maintains these, and strictly performs processing based on them, and they are not exposed to the peer. Such considerations are discussed in the security model [RDMAP Security].
Note, additionally, that by describing such attributes only through the presence of an optional property of each region, it is possible to describe regions referring to the same physical segment as a combination of attributes, in order to enable efficient processing. Processing of writes to regions marked as persistent, globally visible, or neither ("ordinary" memory) may be optimized appropriately. For example, such memory can be registered multiple times, yielding multiple different Steering Tags which nonetheless merge data in the underlying memory. This can be used by upper layers to enable bulk-type processing with low overhead, by assigning specific attributes through use of the Steering Tag.
When the underlying region is marked as persistent, that the placement of data into persistence is guaranteed only after a successful RDMA Flush directed to the Steering Tag which holds the persistent attribute (i.e. any volatile buffering between the network and the underlying storage has been flushed, and the appropriate platform- and device-specific steps have been performed).
To enable the maximum generality, the RDMA Flush operation is specified to act on a set of bytes in a region, specified by a standard RDMA { STag, Tagged offset, length } descriptor. It is required that each byte of the specified segment be in the requested state before the response to the Flush is generated. However, depending on the implementation, other bytes in the region, or in other regions, may be acted upon as part of processing any RDMA Flush. In fact, any data in any buffer destined for persistent storage, may become persistent at any time, even if not requested explicitly. For example, the host system may flush cache entries due to cache pressure, or as part of platform housekeeping activities. Or, a simple and stateless approach to flushing a specific range might be for all data be flushed and made persistent, system-wide. A possibly more efficient implementation might track previously written bytes, or blocks with "dirty" bytes, and flush only those to persistence. Either result provides the required guarantee.
The RDMA Flush operation provides a response but does not return a status, or can result in an RDMA Terminate event upon failure. A region permission check is performed first, and may fail prior to any attempt to process data. The RDMA Flush operation may fail to make the data persistent, perhaps due to a hardware failure, or a change in device capability (device read-only, device wear, etc). The device itself may support an integrity check, similar to modern error checking and corection (ECC) memory or media error detection on hard drive surfaces, which may signal failure. Or, the request may exceed device limits in size or even transient attribute such as temporary media failure. The behavior of the device itself is beyond the scope of this specification.
Because the RDMA Flush involves processing on the local platform and the actual storage device, in addition to being ordered with certain other RDMA operations, it is expected to take a certain time to be performed. For this reason, the operation is required to be defined as a "queued" operation on the RDMA device, and therefore also the protocol. The RDMA protocol supports RDMA Read (RFC5040) and Atomic (RFC7306) in such a fashion. The iWARP family defines a "queue number" with queue-specific processing that is naturally suited for this. Queuing provides a convenient means for supporting ordering among other operations, and for flow control. Flow control for RDMA Reads and Atomics on any given Queue Pair share incoming and outgoing crediting depths ("IRD/ORD"); operations in this specification share these values and do not define their own separate values.
The extension does not include a "RDMA Write to persistence", that is, a modifier on the existing RDMA Write operation. While it might seem a logical approach, several issues become apparent:
For these reasons, it is deemed a non-requirement to extend the existing RDMA Write operation.
Similarly, the extension does not consider the use of RDMA Read to implement Flush. Historically, an RDMA Read has been used by applications to ensure that previously written data has been processed by the responding RNIC and has been submitted for ordered Placement. However, this is inadequate for implementing the required RDMA Flush:
Again, for these reasons, it is deemed a non-requirement to reuse or extend the existing RDMA Read operation.
Therefore, no changes to existing specified RDMA operations are proposed, and the protocol is unchanged if the extensions are not invoked.
The persistence of data is a key property by which applications implement transactional behavior. Transactional applications, such as databases and log-based filesystems, among many others, implement a "two phase commit" wherein a write is made durable, and only upon success, a validity indicator for the written data is set. Such semantics are challenging to provide over an RDMA fabric, as it exists today. The RDMA Write operation does not generate an acknowledgement at the RDMA layers. And, even when an RDMA Write is delivered, if the destination region is persistent, its data can be made persistent at any time, even before a Flush is requested. Out-of-order DDP processing, packet fragmentation, and other matters of scheduling transfers can introduce partial delivery and ordering differences. If a region is made persistent, or even globally visible, before such sequences are complete, significant application-layer inconsistencies can result. Therefore, applications may require fine-grained control over the placement of bytes. In current RDMA storage solutions, these semantics are implemented in upper layers, potentially with additional upper layer message signaling, and corresponding roundtrips and blocking behaviors.
In addition to controlling placement of bytes, the ordering of such placement can be important. By providing an ordered relationship among write and flush operations, a basic transaction scenario can be constructed, in a way which can function with equal semantics both locally and remotely. In a "log-based" scenario, for example, a relatively large segment (log "record") is placed, and made durable. Once persistence of the segment is assured, a second small segment (log "pointer") is written, and optionally also made persistent. The visibility of the second segment is used to imply the validity, and persistence, of the first. Any sequence of such log-operation pairs can thereby always have a single valid state. In case of failure, the resulting string (log) of transactions can therefore be recovered up to and including the final state.
Such semantics are typically a challenge to implement on general purpose hardware platforms, and a variety of application approaches have become common. Generally, they employ a small, well-aligned atom of storage for the second segment (the one used for validity). For example, an integer or pointer, aligned to natural memory address boundaries and CPU and other cache attributes, is stored using instructions which provide for atomic placement. Existing RDMA protocols, however, provide no such capability.
This document specifies an Atomic Write extension, which, appropriately constrained, can serve to provide similar semantics. A small (64 bit) payload, sent in a request which is ordered with respect to prior RDMA Flush operations on the same stream and targeted at a segment which is aligned such that it can be placed in a single hardware operation, can be used to satisfy the previously described scenario. Note that the visibility of this payload can also serve as an indication that all prior operations have succeeded, enabling a highly efficient application-visible memory semaphore.
An additional matter remains with persistence - the integrity of the persistent data. Typically, storage stacks such as filesystems and media approches such as SCSI T10 DIF or filesystem integrity checks such as ZFS provide for block- oir file-level protection of data at rest on storage devices. With RDMA protocols and physical memory, no such stacks are present. And, to add such support would introduce CPU processing and its inherent latency, counter to the goals of the remote storage approach. Requiring the peer to verify by remotely reading the data is prohibitive in both bandwidth and latency, and without additional mechanisms to ensure the actual stored data is read (and not a copy in some volatile cache), can not provide the necessary result.
To address this, an integrity operation is required. The integrity check is initiated by the upper layer or application, which optionally computes the expected hash of a given segment of arbitrary size, sending the hash via an RDMA Verify operation targeting the RDMA segment on the responder, and the responder calculating and optionally verifying the hash on the indicated data, bypassing any volatile copies remaining in caches. The responder responds with its computed hash value, or optionally, terminates the connection with an appropriate error status upon mismatch. Specifying this optional termination behavior enables a transaction to be sent as WRITE-FLUSH-VERIFY-ATOMICWRITE, without any pipeline bubble. The result (carried by the subsequently ordered ATOMIC_WRITE) will not not be committed as valid if any prior operation is terminated, and in this case, recovery can be initiated by the requestor immediately from the point of failure. On the other hand, an errorless "scrub" can be implemented without the optional termination behavior, by providing no value for the expected hash. The responder will return the computed hash of the contents.
The hash algorithm is not specified by the RDMA protocol, instead it is left to the upper layer to select an appropriate choice based upon the strength, security, length, support by the RNIC, and other criteria. The size of the resulting hash is therefore also not specified by the RDMA protocol, but is dictated by the hash algorithm. The RDMA protocol becomes simply a transport for exchanging the values.
It should be noted that the design of the operation, passing of the hash value from requestor to responder (instead of, for example, computing it at the responder and simply returning it), allows both peers to determine immediately whether the segment is considered valid, permitting local processing by both peers if that is not the case. For example, a known-bad segment can be immediately marked as such ("poisoned") by the responder platform, requiring recovery before permitting access. [cf ACPI, JEDEC, SNIA NVMP specifications]
The new operations imply new access methods ("verbs") to local persistent memory which backs registrations. Registrations of memory which support persistence will follow all existing practices to ensure permission-based remote access. The RDMA protocols do not expose these permissions on the wire, instead they are contained in local memory registration semantics. Existing attributes are Remote Read and Remote Write, which are granted individually through local registration on the machine. If an RDMA Read or RDMA Write operation arrives which targets a segment without the appropriate attribute, the connection is terminated.
In support of the new operations, new memory attributes are needed. For RDMA Flush, two "Flushable" attributes provide permission to invoke the operation on memory in the region for persistence and/or global visibility. When registering, along with the attribute, additional local information can be provided to the RDMA layer such as the type of memory, the necessary processing to make its contents persistent, etc. If the attribute is requested for memory which cannot be persisted, it also allows the local provider to return an error to the upper layer, obviating the upper layer from providing the region to the remote peer.
For RDMA Verify, the "Verifiable" attribute provides permission to compute the hash of memory in the region. Again, along with the attribute, additional information such as the hash algorithm for the region is provided to the local operation. If the attribute is requested for non-persistent memory, or if the hash algorithm is not available, the local provider can return an error to the upper layer. In the case of success, the upper layer can exchange the necessary information with the remote peer. Note that the algorithm is not identified by the on-the-wire operation as a result. Establishing the choice of hash for each region is done by the local consumer, and each hash result is merely transported by the RDMA protocol. Memory can be registered under multiple regions, if differing hashes are required, for example unique keys may be provisoned to implement secure hashing. Also note that, for certain "reversible" hash algorithms, this may allow peers to effectively read the memory, therefore, the local platform may require additional read permissions to be associated with the Verifiable permission, when such algorithms are selected.
The Atomic Write operation requires no new attributes, however it does require the "Remote Write" attribute on the target region, as is true for any remotely requested write. If the Atomic Write additionally targets a Flushable region, the RDMA Flush is performed separately. It is never generally possible to achieve persistence atomically with placement, even locally.
The extensions in this document fall into two categories:
These categories are described, and may be implemented, separately.
The wire-related aspects of the extensions are discussed in this section.This document defines the following new RDMA operations.
For reference, Figure 1 depicts the format of the DDP Control and RDMAP Control Fields, in the style and convention of RFC5040 and RFC7306:
The DDP Control Field consists of the T (Tagged), L (Last), Resrv, and DV (DDP protocol Version) fields are defined in RFC5041. The RDMAP Control Field consists of the RV (RDMA Version), Rsv, and Opcode fields are defined in RFC5040. No change or extension is made to these fields by this specification.
This specification adds values for the RDMA Opcode field to those specified in RFC5040. Table 1 defines the new values of the RDMA Opcode field that are used for the RDMA Messages defined in this specification.
As shown in Table 1, STag (Steering Tag) and Tagged Offset are valid only for certain RDMA Messages defined in this specification. Table 1 also shows the appropriate Queue Number for each Opcode.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |T|L| Resrv | DV| RV|R| Opcode | | | | | | |s| | | | | | | |v| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Invalidate STag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
DDP Control and RDMAP Control Fields
All RDMA Messages defined in this specification MUST carry the following values:
Note: N/A in the table below means Not Applicable
-------+------------+-------+------+-------+-----------+------------- RDMA | Message | Tagged| STag | Queue | Invalidate| Message Opcode | Type | Flag | and | Number| STag | Length | | | TO | | | Communicated | | | | | | between DDP | | | | | | and RDMAP -------+------------+-------+------+-------+-----------+------------- -------+------------+------------------------------------------------ 01100b | RDMA Flush | 0 | N/A | 1 | opt | Yes | Request | | | | | -------+------------+------------------------------------------------ 01101b | RDMA Flush | 0 | N/A | 3 | N/A | No | Response | | | | | -------+------------+------------------------------------------------ 01110b | RDMA Verify| 0 | N/A | 1 | opt | Yes | Request | | | | | -------+------------+------------------------------------------------ 01111b | RDMA Verify| 0 | N/A | 3 | N/A | Yes | Response | | | | | -------+------------+------------------------------------------------ 10000b | Atomic | 0 | N/A | 1 | opt | Yes | Write | | | | | | Request | | | | | -------+------------+------------------------------------------------ 10001b | Atomic | 0 | N/A | 3 | N/A | No | Write | | | | | | Response | | | | | -------+------------+------------------------------------------------
Additional RDMA Usage of DDP Fields
This extension adds RDMAP use of Queue Number 1 for Untagged Buffers for issuing RDMA Flush, RDMA Verify and Atomic Write Requests, and use of Queue Number 3 for Untagged Buffers for tracking the respective Responses.
All other DDP and RDMAP Control Fields are set as described in RFC5040 and RFC7306.
Table 3 defines which RDMA Headers are used on each new RDMA Message and which new RDMA Messages are allowed to carry ULP payload.
-------+------------+-------------------+------------------------- RDMA | Message | RDMA Header Used | ULP Message allowed in Message| Type | | the RDMA Message OpCode | | | -------+------------+-------------------+------------------------- -------+------------+-------------------+------------------------- 01100b | RDMA Flush | None | No | Request | | -------+------------+-------------------+------------------------- 01101b | RDMA Flush | None | No | Response | | -------+------------+--------------------------------------------- 01110b | RDMA Verify| None | No | Request | | -------+------------+-------------------+------------------------- 01111b | RDMA Verify| None | No | Response | | -------+------------+--------------------------------------------- 10000b | Atomic | None | No | Write | | | Request | | -------+------------+--------------------------------------------- 10000b | Atomic | None | No | Write | | | Response | | -------+------------+---------------------------------------------
RDMA Message Definitions
The RDMA Flush operation requests that all bytes in a specified region are to be made persistent and/or globally visible, under control of specified flags. As specified in section 4 its operation is ordered after the successful completion of any previous requested RDMA Write or certain other operations. The response is generated after the region has reached its specified state. The implementation MUST fail the operation and send a terminate message if the RDMA Flush cannot be performed, or has encountered an error.
The RDMA Flush operation MUST NOT be completed by the data sink until all data has attained the requested state. Achieving persistence may require programming and/or flushing of device buffers, while achieving global visibility may require flushing of cached buffers across the entire platform interconnect. In no event are persistence and global visibility achieved atomically, one may precede the other and either may complete at any time.The Atomic Write operation may be used by an upper layer consumer to indicate that either or both dispositions are available after completion of the RDMA Flush, in addition to other approaches.
The RDMA Flush Request Message makes use of the DDP Untagged Buffer Model. RDMA Flush Request messages MUST use the same Queue Number as RDMA Read Requests and RDMA Extensions Atomic Operation Requests (QN=1). Reusing the same queue number for RDMA Flush Requests allows the operations to reuse the same RDMA infrastructure (e.g. Outbound and Inbound RDMA Read Queue Depth (ORD/IRD) flow control) as that defined for RDMA Read Requests.
The RDMA Flush Request Message carries a payload that describes the ULP Buffer address in the Responder's memory. The following figure depicts the Flush Request that is used for all RDMA Flush Request Messages:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink STag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink Tagged Offset | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Flush Disposition Flags +G+P| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Flush Request
The RDMA Flush Response Message makes use of the DDP Untagged Buffer Model. RDMA Flush Response messages MUST use the same Queue Number as RDMA Extensions Atomic Operation Responses (QN=3). No payload is passed to the DDP layer on Queue Number 3.
Upon successful completion of RDMA Flush processing, an RDMA Flush Response MUST be generated.
If during RDMA Flush processing on the Responder, an error is detected which would result in the requested region to not achieve the requested disposition, the Responder MUST generate a Terminate message. The contents of the Terminate message are defined in Section 5.2.
Ordering and completion rules for RDMA Flush Request are similar to those for an Atomic operation as described in section 5 of RFC7306. The queue number field of the RDMA Flush Request for the DDP layer MUST be 1, and the RDMA Flush Response for the DDP layer MUST be 3.
There are no ordering requirements for the placement of the data, nor are there any requirements for the order in which the data is made globally visible and/or persistent. Data received by prior operations (e.g. RDMA Write) MAY be submitted for placement at any time, and persistence or global visibility MAY occur before the flush is requested. After placement, data MAY become persistent or globally visible at any time, in the course of operation of the persistency management of the storage device, or by other actions resulting in persistence or global visibility.
Any region segment specified by the RDMA Flush operation MUST be made persistent and/or globally visible before successful return of the operation. If RDMA Flush processing is successful on the Responder, meaning the requested bytes of the region are, or have been made persistent and/or globally visible, as requested, the RDMA Flush Response MUST be generated.
There are no atomicity guarantees provided on the Responder's node by the RDMA Flush Operation with respect to any other operations. While the Completion of the RDMA Flush Operation ensures that the requested data was placed into, and flushed from the target Tagged Buffer, other operations might have also placed or fetched overlapping data. The upper layer is responsible for arbitrating any shared access.
(Sidebar) It would be useful to make a statement about other RDMA Flush to the target buffer and RDMA Read from the target buffer on the same connection. Use of QN 1 for these operations provides ordering possibilities which imply that they will "work" (see #7 below). NOTE: this does not, however, extend to RDMA Write, which is not queued nor sequenced and therefore does not employ a DDP QN.
The RDMA Verify operation requests that all bytes in a specified region are to be read from the underlying storage and that an integrity hash be calculated. As specified in section 4 its operation is ordered after the successful completion of any previous requested RDMA Write and RDMA Flush, or certain other operations. The implementation MUST fail the operation and send a terminate message if the RDMA Verify cannot be performed, has encountered an error, or if a hash value was provided in the request and the calculated hash does not match. If no condition for a Terminate message is encountered, the response is generated containing the result calculated hash value.
The RDMA Verify Request Message makes use of the DDP Untagged Buffer Model. RDMA Verify Request messages MUST use the same Queue Number as RDMA Read Requests and RDMA Extensions Atomic Operation Requests (QN=1). Reusing the same queue number for RDMA Read and RDMA Flush Requests allows the operations to reuse the same RDMA infrastructure (e.g. Outbound and Inbound RDMA Read Queue Depth (ORD/IRD) flow control) as that defined for those requests.
The RDMA Verify Request Message carries a payload that describes the ULP Buffer address in the Responder's memory. The following figure depicts the Verify Request that is used for all RDMA Verify Request Messages:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink STag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink Tagged Offset | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Hash Value (optional, variable) | | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Verify Request
The Verify Response Message makes use of the DDP Untagged Buffer Model. Verify Response messages MUST use the same Queue Number as RDMA Flush Responses (QN=3). The RDMAP layer passes the following payload to the DDP layer on Queue Number 3. The RDMA Verify Response is not sent when a Terminate message is generated through specifying the Compare Flag as 1, and a mismatch occurs.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Hash Value (variable) | | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Verify Response
Ordering and completion rules for RDMA Verify Request are similar to those for an Atomic operation as described in section 5 of RFC7306. The queue number field of the RDMA Verify Request for the DDP layer MUST be 1, and the RDMA Verify Response for the DDP layer MUST be 3.
As specified in section 4, RDMA Verify and RDMA Flush are executed by the Data Sink in strict order. When an RDMA Verify follows an RDMA Flush, and because the RDMA Flush MUST ensure that all bytes are in the specified state before responding, any RDMA Verify that follows can be assured that it is operating on flushed data. If unflushed data has been sent to the region segment between the operations, and since data may be made persistent and/or globally visible by the Data Sink at any time, the result of any such RDMA Verify is undefined.
The Atomic Write operation provides a block of data which is placed to a specified region atomically, and as specified in section 4 its placement is ordered after the successful completion of any previous requested RDMA Flush or RDMA Verify. This specified region is constrained in size and alignment to 64-bits at 64-bit alignment, and the implementation MUST fail the operation and send a terminate message if the placement cannot be performed atomically.
The Atomic Write Operation requires the Responder to write a 64-bit value at a ULP Buffer address that is 64-bit aligned in the Responder's memory, in a manner which is Placed in the responder's memory atomically.
The Atomic Write Request Message makes use of the DDP Untagged Buffer Model. Atomic Write Request messages MUST use the same Queue Number as RDMA Read Requests and RDMA Extensions Atomic Operation Requests (QN=1). Reusing the same queue number for RDMA Flush and RDMA Verify Requests allows the operations to reuse the same RDMA infrastructure (e.g. Outbound and Inbound RDMA Read Queue Depth (ORD/IRD) flow control) as that defined for those Requests.
The Atomic Write Request Message carries an Atomic Write Request payload that describes the ULP Buffer address in the Responder's memory, as well as the data to be written. The following figure depicts the Atomic Write Request that is used for all Atomic Write Request Messages:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink STag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink Tagged Offset | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Atomic Write Request
Atomic Write Operations MUST target ULP Buffer addresses that are 64-bit aligned, and conform to any other platform restrictions on the Responder system. The write MUST NOT be Placed prior to all prior RDMA Flush operations, and therefore all other prior operations, completing successfully.
If an Atomic Write Operation is attempted on a target ULP Buffer address that is not 64-bit aligned, or due to alignment, size, or other platform restrictions cannot be performed atomically:
The Atomic Write Response Message makes use of the DDP Untagged Buffer Model. Atomic Write Response Response messages MUST use the same Queue Number as RDMA Flush Responses (QN=3). The RDMAP layer passes no payload to the DDP layer on Queue Number 3.
As for RFC7306, explicit negotiation by the RDMAP peers of the extensions covered by this document is not required. Instead, it is RECOMMENDED that RDMA applications and/or ULPs negotiate any use of these extensions at the application or ULP level. The definition of such application-specific mechanisms is outside the scope of this specification. For backward compatibility, existing applications and/or ULPs should not assume that these extensions are supported.
In the absence of application-specific negotiation of the features defined within this specification, the new operations can be attempted, and reported errors can be used to determine a remote peer's capabilities. In the case of RDMA Flush and Atomic Write, an operation to a previously Advertised buffer with remote write permission can be used to determine the peer's support. A Remote Operation Error or Unexpected OpCode error will be reported by the remote peer if the Operation is not supported by the remote peer. For RDMA Verify, such an operation may target a buffer with remote read permission.
This section discusses memory registration, new memory and protection attributes, and applicability to both remote and "local" (receives). Because this section does not specify any wire-visible semantic, it is entirely informative.
New platform-specific attributes to RDMA registration, allows them to be processed at the server *only* without client knowledge, or protocol exposure. No client knowledge – robust design ensuring future interop
New local PMEM memory registration example:
STag is processed in receiving RNIC during RDMA operation to specified region, under control of original Perm, Type and Mode.
Discuss the interactions with new operations when upper layer provides Completions to responder (e.g. messages via receive or immediate data via RDMA Write). Natural conclusion of ordering rules, but made explicit.
Ordering of operations is critical: Such RDMA Writes cannot be allowed to “pass” persistence or global visibility, and RDMA Flush may not begin until prior RDMA Writes to flush region are accounted for. Therefore, ULP protocol implications may also exist.
Writethrough behavior on persistent regions and reasons for same. Consider recommending a local writethrough behavior on any persistent region, to support a nonblocking hurry-up to avoid future stalls on a subsequent cache flush, prior to a flush. Also, it would enhance storage integrity. Drive selection of this behavior from memory registration, so RNIC may "look up" the desired behavior in its TPT.
PCI extension to support Flush would allow RNIC to provide persistence and/or global visibility directly and efficiently To Memory, CPU, PCI Root, PM device, PCIe device, … Avoids CPU interaction Supports strong data consistency model. Performs equivalent of: CLFLUSHOPT (region list) or some other flow tag. Or if RNIC participates in platform consistency domain on memory bus or within CPU complex… other possibilities exist!
Also consider additional "integrity check" behavior (hash algorithm) specified per-region. If so, providing it as a registration parameter enables fine-graned control, and enables storing it in per-region RNIC state, making its processing optional and straightforward.
A similar approach applicable to providing security key for encrypting/decrypting access on per-region basius, without protocol exposure. [SDC2017 presentation]
Any other per-region processing to be explored.
The table in this section specifies the ordering relationships for the operations in this specification and in those it extends, from the standpoint of the Requester. Note that in the table, Send Operation includes Send, Send with Invalidate, Send with Solicited Event, and Send with Solicited Event and Invalidate. Also note that Immediate Operation includes Immediate Data and Immediate Data with Solicited Event.
Note: N/A in the table below means Not Applicable
----------+------------+-------------+-------------+----------------- First | Second | Placement | Placement | Ordering Operation | Operation | Guarantee at| Guarantee at| Guarantee at | | Remote Peer | Local Peer | Remote Peer ----------+------------+-------------+-------------+----------------- RDMA Flush| TODO | No Placement| N/A | Completed in | | Guarantee | | Order | | between Foo | | | | and Bar | | ----------+------------+-------------+-------------+----------------- TODO | RDMA Flush | Placement | N/A | TODO | | Guarantee | | | | between Foo | | | | and Bar | | ----------+------------+-------------+-------------+----------------- TODO | TODO | Etc | Etc | Etc ----------+------------+-------------+-------------+----------------- ----------+------------+-------------+-------------+-----------------
Ordering of Operations
In addition to error processing described in section 7 of RFC5040 and section 8 of RFC7306, the following rules apply for the new RDMA Messages defined in this specification.
The Local Peer MUST send a Terminate Message for each of the following cases:
On incoming RDMA Flush and RDMA Verify Requests, the following MUST be validated:
The following additional validation MUST be performed:
This document requests that IANA assign the following new operation codes in the "RDMAP Message Operation Codes" registry defined in section 3.4 of [RFC6580].
Note to RFC Editor: this section may be edited and updated prior to publication as an RFC.
This document specifies extensions to the RDMA Protocol specification in RFC5040 and RDMA Protocol Extensions in RFC7306, and as such the Security Considerations discussed in Section 8 of RFC5040 and Section 9 of RFC7306 apply. In particular, all operations use ULP Buffer addresses for the Remote Peer Buffer addressing used in RFC5040 as required by the security model described in [RDMAP Security].
If the "push mode" transfer model discussed in section 2 is implemented by upper layers, new security considerations will be potentially introduced in those protocols, particularly on the server, or target, if the new memory regions are not carefully protected. Therefore, for them to take full advantage of the extension defined in this document, additional security design is required in the implementation of those upper layers. The facilities of RFC5042 can provide the basis for any such design.
In addition to protection, in "push mode" the server or target will expose memory resources to the peer for potentially extended periods, and will allow the peer to perform remote requests which will necessarily consume shared resources, e.g. memory bandwidth, power, and memory itself. It is recommended that the upper layers provide a means to gracefully adjust such resources, for example using upper layer callbacks, without resorting to revoking RDMA permissions, which would summarily close connections. With the initiator applications relying on the protocol extension itself for managing their required persistence and/or global visibility, the lack of such an approach would lead to frequent recovery in low-resource situations, potentially opening a new threat to such applications.
This section will be deleted in a future document revision.
Complete the discussion in section 3.2 and its subsections, Local Extension semantics.
Complete the Ordering table in section 4. Carefully include discussion of the order of "start of execution" as well as completion, which are somewhat more involved than prior RDMA operation ordering.
RDMA Flush "selectivity", to provide default flush semantics with broader scope than region-based. If specified, a flag to request that all prior write operations on the issuing Queue Pair be flushed with the requested disposition(s). This flag may simplify upper layer processing, and would allow regions larger than 4GB-1 byte to be flushed in a single operation. The STag, Offset and Length will be ignored in this case. It is to-be-determined how to extend the RDMA security model to protect other regions associated with this Queue Pair from unintentional or unauthorized flush.
The authors wish to thank Jim Pinkerton, who contributed to an earlier version of the specification, and Brian Hausauer and Kobby Carmona, who have provided significant review and valuable comments.
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997. |
[RFC5040] | Recio, R., Metzler, B., Culley, P., Hilland, J. and D. Garcia, "A Remote Direct Memory Access Protocol Specification", RFC 5040, DOI 10.17487/RFC5040, October 2007. |
[RFC5041] | Shah, H., Pinkerton, J., Recio, R. and P. Culley, "Direct Data Placement over Reliable Transports", RFC 5041, DOI 10.17487/RFC5041, October 2007. |
[RFC5042] | Pinkerton, J. and E. Deleganes, "Direct Data Placement Protocol (DDP) / Remote Direct Memory Access Protocol (RDMAP) Security", RFC 5042, DOI 10.17487/RFC5042, October 2007. |
[RFC6580] | Ko, M. and D. Black, "IANA Registries for the Remote Direct Data Placement (RDDP) Protocols", RFC 6580, DOI 10.17487/RFC6580, April 2012. |
[RFC7306] | Shah, H., Marti, F., Noureddine, W., Eiriksson, A. and R. Sharp, "Remote Direct Memory Access (RDMA) Protocol Extensions", RFC 7306, DOI 10.17487/RFC7306, June 2014. |
This appendix is for information only and is NOT part of the standard. It simply depicts the DDP Segment format for each of the RDMA Messages defined in this specification.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP Control | RDMA Control | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved (Not Used) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP (Flush Request) Queue Number (1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP (Flush Request) Message Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink STag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink Tagged Offset | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Disposition Flags +G+P| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
RDMA Flush Request, DDP Segment
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP Control | RDMA Control | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved (Not Used) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP (Flush Response) Queue Number (3) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP (Flush Response) Message Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
RDMA Flush Response, DDP Segment
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP Control | RDMA Control | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved (Not Used) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP (Verify Request) Queue Number (1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP (Verify Request) Message Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink STag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink Tagged Offset | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Hash Value (optional, variable) | | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
RDMA Verify Request, DDP Segment
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP Control | RDMA Control | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved (Not Used) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP (Verify Response) Queue Number (3) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP (Verify Response) Message Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Hash Value (variable) | | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
RDMA Verify Response, DDP Segment
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP Control | RDMA Control | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved (Not Used) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP (Atomic Write Request) Queue Number (1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP (Atomic Write Request) Message Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink STag | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink Length (value=8) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data Sink Tagged Offset | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data (64 bits) | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Atomic Write Request, DDP Segment
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP Control | RDMA Control | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved (Not Used) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP (Atomic Write Response) Queue Number (3) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | DDP (Atomic Write Response) Message Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Atomic Write Response, DDP Segment