Internet Engineering Task Force A. Agache
Internet-Draft C. Raiciu
Intended status: Experimental University Politehnica of Bucharest
Expires: January 21, 2016 July 20, 2015

TCP Sendbuffer Advertising
draft-agache-tcpm-sndbufadv-00

Abstract

Network operators have difficulty in understanding the end-to-end performance of TCP connections through their networks. By observing packets at different vantage points on their path and maintaining per flow state, network operators can detect packet losses, retransmission and estimate RTTs, among other metrics. A key information needed by networks is whether a connection is limited by the network or by the application. This information is very difficult to accurately infer by passive measurements.

We propose to advertise sendbuffer occupancy in TCP: each segment will carry the amount of backlogged data present in the sender's buffer. This information allows networks to discern between application-limited, network-limited and flow-control limited flows, creating new avenues of network optimization.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on January 21, 2016.

Copyright Notice

Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.


Table of Contents

1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

2. Introduction

Aggregate link statistics, such as packet and loss counts, are easily available in modern networks, but they convey a fairly limited picture of network performance. In many cases, the network needs information about individual flows' demand for bandwidth to take the appropriate resource allocation decisions.

One example is a mobile phone streaming audio or video over a WiFi connection. The default strategy is to always stick to WiFi when available, despite the fact that performance may be terrible and seriously impair user experience. If the mobile network knew the multimedia stream needs more bandwidth, it could fire-up the cellular connection and migrate traffic over there by using mobile client offloading software relying on Multipath TCP [NSDI-12] or Mobile IP [RFC5944].

Another example is in datacenters with Clos topologies (such as the popular FatTree topology [FatTree]), where elephant flows are randomly placed on paths with flow-level Equal Cost Multipath Routing; when one or more elephant flows are placed on the same link, performance degrades despite existing capacity elsewhere in the network. The network can reroute such flows by using tunnels or programmable switches (e.g. Openflow) but the one thing missing is the information regarding which flows could utilize more capacity if given a better path.

Determining if a TCP connection is network limited or not is difficult to do by passive monitoring. The network needs to keep per-flow state, to estimate the sender congestion window and to accurately monitor flight-size. When flight-size is smaller than the congestion window and the receive window, the connection is limited by the application and does not need more capacity.

We propose that each TCP segment should also encode the amount of backlogged data in the TCP sendbuffer. This information enables network boxes and receivers to easily identify connections that need more capacity. Our goal is to have this extension "always on", and it is therefore very important to reduce its overhead. Next, we discuss how to compute and report the amount of backlogged data. We follow with a discussion of signaling options for conveying sendbuffer information.

3. TCP Sendbuffer Structure

           

                     1          2     
            ---|----------|----------|--->
            SND.UNA    SND.NXT   WRITE.SEQ
                                     

        1 - sequence numbers of unacknowledged, in flight data            
        2 - sequence numbers of backlogged data. 
         
                 Anatomy of the TCP Sendbuffer
            

The figure above shows the anatomy of the TCP sendbuffer. SND.UNA represents the oldest sequence number sent but not yet acknowledged. At the other end there is WRITE.SEQ, the tail sequence number of data held in the sendbuffer. Somewhere in-between we have SND.NXT, the sequence number of the next byte to be sent. From SND.NXT to WRITE.SEQ we have backlogged data, written by the application but not yet transmitted.

SND.NXT is constrained by both the receive window and the congestion window as follows:

           
        SND.NXT <= SND.UNA + min(SND.WND, SND.CWND)
            

As long as the receive window is not a bottleneck, and in the absence of hardware issues or software bugs, having SND.NXT smaller than WRITE.SEQ indicates that the congestion window is not large enough, so the connection is network limited at that point in time. The easiest way to implement sendbuffer advertising is to simply copy the amount of backlogged data (WRITE.SEQ-SND.NXT) into the segment when it leaves the TCP stack. However, this will result in non-zero sendbuffer advertisement when the connection is application-limited but the application writes bursts of a few packets. These packets will be sent out immediately on the wire, yet the first packets in the burst will report that the application is backlogged, when in fact it isn't.

To correctly implement sendbuffer advertisement, the sender MUST advertise the amount of backlogged according to the formula below:

           
        SEG.SNDBUF = WRITE.SEQ-SND.UNA - min(SND.WND, SND.CWND), 
                      if WRITE.SEQ > SND.UNA + min(SND.WND, SND.CWND)

        SEG.SNDBUF = 0, otherwise
            

This formula ensures that if an application write fits in the current receive and congestion windows, all the resulting segments will advertise zero backlog data.

4. Negotiating sendbuffer advertising

The standard way to extend TCP is to negotiate the extension during the three-way handshake. The TCP option space, however, is already very crowded in the SYN exchange. Until solutions that extend the TCP option space are standardized, negotiation in the SYN exchange is, in our view, not a feasible option for sendbuffer advertising.

Fortunately, sendbuffer advertising is a sender-side only modification to TCP, and the information it makes available can be used anyone that understands it, be it the network or the receiver. This implies that we can simply bypass the three way handshake as long as the actual encoding of the sendbuffer information in TCP segments does not have negative effects to legacy routers, middleboxes and TCP receivers. We discuss encoding in the next section.

TCP sendbuffer advertising will therefore be a simple sender-only enhancement to the TCP stack that can be enabled by using system-wide configuration (e.g. sysctl in Linux).

5. Encoding sendbuffer information

In this section we discuss two encoding alternatives for sendbuffer information: as new TCP options, in the acknowledgement field of data segments and in the receive-window field.

The first solution is to simply encode sendbuffer information in a new TCP option on every segment carrying data in a TCP connection, without negotiating this extension in the three way handshake. This only adds 6B of overhead to each TCP segment. This option is feasible only when there is sufficient space in the TCP option field of the corresponding data segment.

Avoding the option negotiation will work really well in datacenters where it can be ensured out-of-band that all machines either know sendbuffer advertising or are unaffected by segments carrying new options. In the Internet, before advertising sendbuffer information in new TCP options we need to ensure that: a) existing TCP stacks are robust to unknown options, simply ignoring them, and b) middleboxes do not drop segments carrying unknown options. Existing studies [IMC-11] imply that the wide majority of network paths either allow unknown options or drop the options, allowing the segments through. Only a very small fraction of paths drop the segments with unknown options. To cope with such cases, the implementation MUST NOT include sendbuffer information on retransmitted packets, to ensure that the connection makes some progress even in the presence of such middleboxes.

Our second solution is based on the observation that while TCP itself is bidirectional, most connections in practice will transfer data unidirectionally most times. The endpoints can be either data senders or receivers at different moments, but they rarely act as both at the same time. When traffic is unidirectional, the sender sends the same value for the acknowledgement number and receive window field over and over again.

We propose to reuse one or both of these fields to advertise sendbuffer information instead when traffic is unidirectional. To detect unidirectional traffic, the sender will maintain a state variable called SND.NUM_SEG that is initially set to zero, and is zeroed whenever a segment with a valid ACK field is sent out. SND.NUM_SEG will be incremented whenever a segment is received. A sendbuffer advertisment SHOULD be encoded in outgoing segments only when SND.NUM_SEG = 0.

Sendbuffer advertising will encode the proper value in the ACK field and NOT set the ACK flag. This ensures the receiver and other on-path hosts will ignore the field altogether. We still need, however, to inform parties interested in sendbuffer information they can use the value of the ACK field.

In datacenters, we can simply define one of the reserved TCP flags as the sendbuffer advertisement flag. When this flag is set, the sendbuffer value is encoded in the ACK field. The sendbuffer advertisement flag and the ACK flag CANNOT be set simultaneously.

In the Internet, redefining the meaning of one of the reserved flags will simply not work through existing middleboxes; additionally, certain middleboxes may zero the ACK field when the ACK flag is not set. In this context, we propose to use the receive window field in segments carrying sendbuffer information to encode a checksum of this information. Interested parties will: a) scan for data segments with the ACK flag not set, b) compute a 1's complement checksum of the ACK field and check it against the receive window field. In case of a match, the sendbuffer information can be used. To understand the feasibility of this encoding, however, tests must to be conducted to check the behaviour of middleboxes when the ACK flag is not set.

6. References

6.1. Normative References

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997.

6.2. Informative References

[FatTree] Al-Fares, M., Loukissas, A. and A. Vahdat, "A scalable, commodity data center network architecture", 2008.
[IMC-11] Honda, M., Nishida, Y., Raiciu, C., Greenhalgh, A., Handley, M. and H. Tokuda, "Is it still possible to extend tcp?", 2011.
[NSDI-12] Raiciu, C., Paasch, C., Barre, S., Ford, A., Honda, M., Duchene, F., Bonaventure, O. and M. Handley, "How hard can it be? designing and implementing a deployable multipath tcp", 2012.
[RFC5944] Perkins, C., "IP Mobility Support for IPv4, Revised", RFC 5944, November 2010.

Authors' Addresses

Alexandru Agache University Politehnica of Bucharest Splaiul Independentei 313 Bucharest, Romania EMail: alexandru.agache@cs.pub.ro
Costin Raiciu University Politehnica of Bucharest Splaiul Independentei 313 Bucharest, Romania EMail: costin.raiciu@cs.pub.ro