Network Working Group | J. Iyengar |
Internet-Draft | Franklin and Marshall College |
Intended status: Standards Track | S. Cheshire |
Expires: January 16, 2014 | J. Graessley |
Apple | |
July 15, 2013 |
Minion - Wire Protocol
draft-iyengar-minion-protocol-01
Minion uses TCP-format packets on-the-wire, for compatibility with existing NATs, Firewalls, and similar middleboxes, but provides a richer set of facilities to the application, as described in the Minion Service Model document. This document specifies the details of the on-the-wire protocol used to provide those services.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 16, 2014.
Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in "Key words for use in RFCs to Indicate Requirement Levels" [RFC2119].
This document uses terminology like "kernel" and "user-level", as those terms pertain to many of today's Unix-like operating systems. Equivalent concepts apply to software that is built using a different architectural model than may not include such an obvious kernel/user split.
Minion uses TCP-format packets on-the-wire, to provide full compatibility with existing NATs, Firewalls, and similar middleboxes, but provides a richer set of facilities to the application, described in the Minion Service Model and Conceptual API document [minserv]. This document specifies the details of the on-the-wire protocol used to provide those services. Before reading this protocol specification document, familiarity with the Minion Service Model [minserv] is strongly recommended. That information is not repeated here.
Minion runs over a standard TCP connection. Therefore, IP addresses and TCP ports are used just as they are with TCP [RFC0793].
Minion is also designed to be able to use a modified TCP connection which supports out-of-order delivery, giving better low-latency performance on lossy networks, for use by the kinds of application that today would use UDP [RFC0768] to achieve low-latency delivery. The goal of providing low-latency delivery -- and consequently the need to be able to handle a data stream that may have gaps -- is reflected in various aspects of the Minion protocol design, such as the use of DTLS instead of TLS, and the use of Consistent Overhead Byte Stuffing [COBS] for reliably extracting messages from an incomplete data stream. Minion is able to take advantage of out-of-order delivery where the network stack offers that, but Minion does not require it. Minion still works correctly when the performance benefits of out-of-order delivery are not available.
Minion supports messages of arbitrary size. Large messages are broken into chunks a little under 16 kilobytes each (the DTLS maximum record size, minus a few bytes for Minion header). At the receiving end the Minion chunks are reassembled into Minion messages and delivered to the client application. Small messages are sent in a single Minion chunk.
Normally messages are sent by the client as a single atomic unit, and delivered to the receiving client as a single atomic unit. For messages too large to fit conveniently in memory, the message may be built incrementally by the sender, and delivered to the receiving client incrementally, a chunk at a time.
When a Minion message is complete, or has at least one maximum Minion chunk size of data accumulated, then if it is eligible to be sent according to the message ordering facilities offered by the Minion Service Model [minserv] (Sender Ordering, Receiver Ordering, and Chaining) a Minion chunk is generated.
Each Minion chunk contains a Minion chunk header followed by the client's message data, as described in Section 3 "Minion Chunk Format".
Each Minion chunk is encrypted using DTLS [RFC6347].
Each encrypted DTLS payload is then framed using RECOBS, as described in Section 4 "Recursively Embeddable COBS", so that it begins with a 00 byte and ends with an FF byte.
The framed, encrypted chunk is then enqueued for transmission.
If the kernel networking code supports multiple priorities, then the framed, encrypted chunk is placed in the transmission queue for the stated priority level. Any time the TCP congestion window and/or receive window rules allow more data to be sent, data is drawn from the highest-priority non-empty transmit buffer, assigned the next block of unused TCP sequence numbers, formed into a TCP segment, and transmitted on the wire. This just-in-time TCP sequencing mechanism has the effect of causing higher-priority data to be inserted right at the front of the conceptual combined transmit buffer, at the earliest possible byte boundary, unconstrained by message or chunk boundaries in the lower-priority messages. This is possible because the RECOBS framing is robust to pre-emption at any arbitrary byte boundary.
Note that, when priorities are supported, chunks above the lowest priority MUST be delivered to the kernel in such a way that they are sent completely before the kernel resumes sending the lower-priority traffic. The RECOBS framing supports interrupting a lower priority stream with a higher-priority chunk, but not alternating back and forth between two priority levels. Once a higher-priority chunk interrupts lower-priority traffic, the higher-priority chunk must be completed before the lower-priority traffic resumes. Typically this is easily achieved by delivering the chunk to the kernel atomically in a single write call.
When connecting to a server with a globally routable address, TCP is generally preferable to UDP. TCP includes the SYN and FIN bits which tell a NAT gateway when a connection starts and ends. In particular, the FIN bit tells the NAT gateway when it can discard state related to that mapping. UDP has no defined connection start/end indicators, which means that unused UDP mappings are much more likely to accumulate, which means that NAT gateways tend to be more aggressive about timing out UDP mappings [Study], which means that clients using UDP need to be more aggressive about sending keepalive traffic, which is bad both for network efficiency and for battery life. Port Control Protocol (PCP) [RFC6887] offers some future hope of alleviating this problem by allowing clients to explicitly negotiate for longer mapping lifetimes, but PCP is not yet widely deployed. In the meantime, if use of UDP increases, NAT gateways are likely to be accumulating mappings even more rapidly, with no way to differentiate which are still required and which may be safely discarded, with the result that UDP mappings may have to be discarded even more aggressively. While a discarded UDP mapping can be recreated by another outgoing UDP packet, in the time between when the UDP mapping is discarded and then recreated, the client is cut off an unable to receive inbound communication from server or peer at the other end. Therefore, we believe that it is preferable to use TCP where possible.
However, when connecting to a peer which is itself also behind a NAT gateway, in the absence of PCP support [RFC6887], techniques like Interactive Connectivity Establishment (ICE) [RFC5245] are used, and research has shown that there are cases where ICE works for UDP but not for TCP [RFC5128].
To accomodate both usage scenarios, Minion is generally used with standard TCP format packets, but for peer-to-peer scenarios where TCP ICE is found not to work, Minion can be used encapsulated inside UDP [TCPoUDP] instead.
A Minion Chunk begins with an eight-byte header, followed by the client's message data:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |C| Code |Pri| This Minion Chunk ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved |RCP| Referenced Minion Chunk ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : : : Minion Chunk Data : : : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 1: Minion Chunk Format
If the Complete ('C') bit is zero, this message is incomplete; the receiver should expect to receive additional continuation chunks for this message. If the Complete bit is one, this message is complete; there will be no subsequent continuation chunks for this message.
The seven-bit chunk code identifies what type of chunk this is, as described below.
The two-bit priority field indicates the priority level for this message, with 0 being the highest priority and 3 being the default (lowest-level) priority.
Every Minion chunk has a Chunk ID. This is a 22-bit value assigned from a monotonically increasing 22-bit cyclic counter. This means that Chunk IDs are reused every 2^22 chunks. At any given moment in time though, only a small portion of the 22-bit ID space is actively in use, so Chunk IDs are not ambiguous. Each of the four priority levels has its own 22-bit Chunk ID space, i.e., Priority 1 Chunk 7 and Priority 2 Chunk 7 are different chunks. Also, the Chunk ID spaces in opposite directions on a connection are separate. Each sender is responsible for selecting the Chunk IDs for the chunks it sends.
In some cases it is useful to refer to messages by ID, and the terms "Message ID" and "Chunk ID" are sometimes used interchangeably. For a message that is sent using a single chunk, the Message ID is the same as the Chunk ID. For a message that is sent using multiple chunks, the Message ID is the Chunk ID of the *final* chunk of the message. One implication of this is that a message's ID is undefined until the message is complete.
Because Chunk IDs are eventually reused, issues of ID lifetime must be carefully considered in the Minion protocol design. For example, since a remote peer could, in principle, wait an arbitrary long length of time before replying to a message, the Message ID of a request that is awaiting a response MUST NOT be reused until the response has been received, and the client has disposed of the request message. Otherwise, a reply could be ambiguous, if there were two outstanding request messages both using the same Message ID at the same time. Likewise, the last Chunk ID of an incomplete message MUST NOT be reused until some subsequent chunk has been added to that message, referencing the previous Chunk ID.
The Reserved field MUST be set to zero on transmission, and MUST be ignored on reception.
For chunk types that need to refer to some other chunk, the Referenced Minion Chunk Priority (RCP) and Referenced Minion Chunk ID fields identify the referenced chunk. Note that some chunk types refer to chunks going in the same direction (e.g., a continuation chunk) and some chunk types refer to chunks going in the reverse direction (e.g., a reply chunk). For chunk types that do not to refer to any other chunk, these two fields MUST be set to zero on transmission, and MUST be ignored on reception.
The Minion Chunk payload data follows the Minion Chunk Header.
There is no explicit length field in the Minion Chunk Header, because the chunk length is determined implicitly in the RECOBS decoding step.
The seven-bit chunk code identifies what kind of chunk this is. There are 128 chunk codes available. The following eight chunk codes are currently defined:
Consistent Overhead Byte Stuffing [COBS] allows complete messages to be reliably located within an incomplete data stream that may contain gaps.
COBS works by transforming the payload data to eliminate all occurrences of zero bytes. This is like PPP byte stuffing, but more efficient; COBS has a worst-case data size overhead below 0.5%. Having created a zero-free payload, the payloads can then be concatenated into a single byte stream, separated by single zero bytes, and the zero bytes unambiguously mark the boundaries between payloads, because we know the payloads themselves no longer contain any zero bytes. At the receiving end the transformation is reversed to recreate the original payload data.
The transformation process [COBS] is, in effect, a simple run length encoding. An extremely simplified summary of the original 1997 COBS encoding is as follows:
Recursively Embeddable COBS (RECOBS) is a derivative of the original 1997 COBS encoding. RECOBS code bytes have the following meanings:
This has the effect that, after encoding, every payload has unambiguous bookends; every payload begins with a single 00, and ends with a single FF. Using this encoding, recursive embedding becomes possible. At *any* point in the encoded byte stream it is now possible to interrupt the byte stream, insert a new RECOBS-encoded payload, and then resume the previous byte stream.
At the receiving end, the decoder is part-way through decoding a payload when the interruption occurs. The decoder sees a 00, which is not legal in RECOBS-encoded data, so the decoder knows a new payload is beginning. Because the decoder has not yet seen the FF end-marker for the previous payload, it knows that payload is incomplete, so it saves its decoding state for later resumption. The decoder then proceeds to decode the embedded payload. When the decoder sees the FF end-marker for the embedded payload, it delivers that fully decoded payload to the waiting client, and then resumes its decoding of the previously interrupted payload.
In principle this recursive embedding could be nested arbitrarily deeply, limited only by the amount of storage the decoder has available for partially-received payloads and their associated decoding state.
In practice, Minion limits RECOBS embedding to four levels (the base level plus three levels of nested interruption) to establish a defined upper bound on the amount of storage required by a decoder.
TCP [RFC0793] implements flow control in the form of the advertised receive window. This is to prevent a faster sender from overwhelming a slower receiver. Minion requires similar protection to prevent a slower receiver running out of memory trying to buffer messages arriving faster than it can handle them.
For a pure user-level library implementation of Minion, this is achieved by having the library set an upper bound on the amount of memory it will use for storing received messages that have not yet been handled by the client. Once this limit is met, the library ceases reading TCP data from the kernel, which causes the TCP receive window to fill up, which causes the sender to stop sending. Once the client consumes some messages, the library then reads more data from the kernel, the TCP receive window opens up, and the sender is permitted to send more data.
However, this means that there is some duplication of buffering -- the TCP receive window in the kernel and additional buffering in the user-level library. For this reason a kernel extension is proposed where a client (the Minion library in this case) can read data from the connection *without* raising the TCP receive window. In a sense it is reading the data "secretly", without admitting to the sender at the other end that it has been read. Those bytes, even though read into user space, are still counted against the TCP receive window. Later, after the client application has actually consumed the message, another kernel call is made to acknowledge consumption of those bytes, and the TCP receive window is raised.
This mechanism integrates message-level flow control with TCP's byte-level flow control, rather than having two independent flow control mechanisms happening concurrently at different levels, in ways that might interact badly with each other.
Note that the Minion protocol design will have to consider possible deadlock situations. For example, suppose one Minion host is refusing to consume any more Minion Chunks because it wishes to send a Reject message for them, but it cannot, because the peer's receive window is closed. Suppose also that the reason the peer's receive window is closed is because the peer also is sitting on a pile of unwanted Minion Chunks that it refuses to consume until it can send a Reject message for them. Possible deadlocks such as these need to be considered, and mechanisms to avoid them created.
One of the main arguments that is often presented to justify why a particular application protocol is built on UDP instead of TCP is that, "UDP is better for 'real time' applications." The supporting reasoning for this is often that, "TCP insists on continuing to retransmit data long after the client doesn't need any more." In truth the real problem is not retransmission; it is that the conventional TCP APIs don't allow received data to be delivered out of order. Suppose a TCP sender has 50 packets in flight at any given time (e.g., the bandwidth x delay product is 75 kB) then the loss of a single packet causes all 49 following packets to stall at the receiver because the API doesn't allow for them to be delivered to the client until the missing packet has been received.
Minion solves this problem by allowing data to be delivered as it arrives, even if there are gaps. But the argument still remains that even after removing the ordering requirement at the receiver, it may still be a waste of bandwidth to retransmit data that will arrive too late to be useful. And indeed, it is possible with TCP to fraudulently acknowledge segments that were in fact not received, and this will cause the sender to not retransmit those segments.
However, we chose not to use fraudulent acknowledgements to suppress retransmissions, because certain NATs, Firewalls and other middleboxes may block traffic if they observe implausible protocol actions which they find suspicious. One of the important goals of Minion is 100% compatibility with today's existing Internet devices, not 99% compatibility.
We expect packet loss to be about 1% (at most a few percent) in a functioning network, and the cost of retransmitting those lost packets, even in the extreme case where *all* the retransmissions turn out to be unnecessary, is an overhead of about 1%. We argue that an overhead of about 1% is an acceptable price to pay in exchange for 100% compatibility with existing NATs, Firewalls and other middleboxes.
While Minion can be implemented entirely as a user-level library built on top of existing standard networking APIs like BSD sockets, it can also benefit from some optional kernel extensions:
These optional kernel extensions are a key part of what makes Minion compelling. Minion can be adopted today by any application, using Minion as a purely user-space library. Such an application performs as well as any application can when it is built on top of standard TCP. However, unlike an application built on top of standard TCP, Minion offers the promise of future kernel support for even better performance. Any given application with its own application-specific protocol is unlikely to receive special kernel support to make just that one application work better. But when many applications all use the Minion protocol, it then becomes reasonable to add kernel support to improve all of those applications.
When implemented entirely as a user-level library, Minion naturally adheres to the TCP specifications (insofar as the underlying operating system adheres to the TCP specifications) because Minion is merely using the operating system's networking APIs.
When optional kernel extensions are in use, they may allow Minion to deviate from classical TCP protocol rules. One such instance of this deviation has already been identified. The TCP protocol rules allow a sender to send a FIN to end a connection, and then follow it with additional data bytes (with higher TCP sequence numbers, so that they fall later in the data stream) which the receiver is expected to discard because it recognizes that they fall after the FIN in the data stream. When out-of-order delivery is enabled, it's possible that if the TCP segment containing the FIN is lost or delayed, then subsequent TCP segments containing data bytes could be incorrectly delivered to the client application, when the TCP protocol rules dictate that they should have been discarded. The ability to send data following the FIN that the receiver is expected to discard is incompatible with out-of-order delivery. Note that this is referring to data that follows the FIN in TCP sequence number space, not data that follows the FIN in transmission order. If, after the FIN has been sent, previously transmitted data is lost and needs to be retransmitted, then this does not cause any problems; the bytes in such retransmitted TCP segments fall *before* the FIN in TCP sequence number space, not after. As a result of this observation, TCP's protocol rules, when used with Minion traffic, are effectively modified as follows:
In reality we do not expect this to be a major burden to TCP implementations. We are not aware of TCP implementations that send data after a connection is closed and then rely on the receiver to discard that data.
No IANA actions are required by this document.
We take security seriously. As this work develops, this section will contain details of any known security issues and possible mitigations.
Many thanks to Bryan Ford, Padma Bhooma and Anumita Biswas for their contributions to the development of Minion.
Thanks to Joe Touch for pointing out that Minion restricts TCP's ability to send data, after a connection is closed, that will then be ignored by the receiver.
[RFC0793] | Postel, J., "Transmission Control Protocol", STD 7, RFC 793, September 1981. |
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. |
[RFC6347] | Rescorla, E. and N. Modadugu, "Datagram Transport Layer Security Version 1.2", RFC 6347, January 2012. |
[minserv] | Iyengar, J., "Minion - Service Model and Conceptual API", Internet-Draft draft-iyengar-minion-concept-00, June 2013. |
[COBS] | Cheshire, S. and M. Baker, "Consistent Overhead Byte Stuffing", September 1997. |
[Study] | Hatonen, S., Nyrhinen, A., Eggert, L., Strowes, S., Sarolahti, P. and M. Kojo, "An Experimental Study of Home Gateway Characteristics", September 1997. |
[RFC0768] | Postel, J., "User Datagram Protocol", STD 6, RFC 768, August 1980. |
[RFC5128] | Srisuresh, P., Ford, B. and D. Kegel, "State of Peer-to-Peer (P2P) Communication across Network Address Translators (NATs)", RFC 5128, March 2008. |
[RFC5245] | Rosenberg, J., "Interactive Connectivity Establishment (ICE): A Protocol for Network Address Translator (NAT) Traversal for Offer/Answer Protocols", RFC 5245, April 2010. |
[RFC6887] | Wing, D., Cheshire, S., Boucadair, M., Penno, R. and P. Selkirk, "Port Control Protocol (PCP)", RFC 6887, April 2013. |
[TCPoUDP] | Cheshire, S., Graessley, J. and S. Cheshire, "Encapsulation of TCP and other Transport Protocols over UDP", Internet-Draft draft-cheshire-tcp-over-udp-00, June 2013. |