Network Working Group L. Deng
Internet-Draft China Mobile
Intended status: Informational February 14, 2014
Expires: August 18, 2014

End Point Properties for Peer Selection
draft-deng-taps-datacenter-00.txt

Abstract

It is noticed that within a data center, unique traffic pattern and performance goals for the transport layer exist, as compared to things on the Internet. This draft discusses the usecases for applying transport APIs from the perspective of an application running in a data center environment, and proposes potential requirements for such API design.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on August 18, 2014.

Copyright Notice

Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.


Table of Contents

1. Introduction

It is noticed that the traffic pattern in a data center is quite different from Internet. First of all, almost all the traffic in data center are carried by TCP (over 90%). Secondly, there are extreme deviation among TCP flows in terms of data volume and duration. while most of the flows are very short, that complete in less than 2-3 round trips, most of the traffic volume belongs to few long lasting flows. ToR switches are highly multiplexed for tens of concurrent TCP flows for most of the time.

The reason behind such a traffic pattern is a combination of following types of data traffic:

(1) highly delay-sensitive short flows resultant from the distributed computing model employed pervasively for interactive Internet application (web search/social networking);

(2) highly delay-sensitive short flows for cluster control/management; and

(3) delay tolerant background flows for bakup/synchronization with considerably large data volume.

2. Terminology

DC: Data Center, is a facility used to house computer systems and associated components, such as telecommunications and storage systems.

ToR: a Top of Rack switch, usually sits on top of a rack of servers and serves as the entrance to other parts of the data center networking as well as inter-connecting the local servers within the rack.

VM: Virtual Machine, is a software implementation of a machine (i.e. a computer) that executes programs like a physical machine.

VM Migration: Virtual Machine Migration, refers to the process of moving a running virtual machine or application between different physical machines.

NIC: Network Interface Controller, is a computer hardware component that connects a computer to a computer network.

DCB: Data center bridging, refers to a set of enhancements to Ethernet local area networks for use in data center environments, such as lossless ethernet.

3. Usecases

Except the web search/query example described in the introduction section, other usecases for optimized data delivery within a DC are presented in the following.

3.1. VM Related Traffic

In virtualized data centers, to cope with the reliability concerns arising from the relatively unreliable general commodity hardware platforms, keeping several identical VM instances running on different physical servers for each other's backup is common practice. In such case, TCP flows for VM backup or migration, although considerably larger in data volume and longer in duration than typical user traffic, are also delay sensitive.

3.2. Application Priorities

For data center accommodating multiple applications, one would certainly prefer differentiation in resource provision in case of congestion, according to the DC operator's provisioning policy or the application's own feature.

For instance, if physical resources in a data center be shared between a delay-sensitive web search engine and a relatively delay-tolerant document/music sharing application, both application's data traffic share the links from loader-balancer to servers and from servers to database are multiplexed on the internal DC network.

3.3. Access Type differentiation

Given various access types for a specific application, the DC operator may want to enforce different QoS policies to some selected group of users, according to their access type. For instance, if the service provider is currently marketing on the mobile market, it could prioritize mobile traffic over fixed traffic.

For potential competing service providers, one may also want to prioritize direct traffic from its own application over other third party users.

3.4. Delay Tolerant Traffic

Delay tolerant traffic, including background software upgrade and other management traffic, such as active measurement data traffic for performance monitory/fault detection should not impact any real productive traffic.

4. Transport Optimization in DC

To fully understand why we need special transport services for DC environment as compared to Internet, it is better to look first at what problems an optimized transport service would be from the perspective of a DC application, begining with the issues it faces in terms of performance degradation.

4.1. Performance degradation in DC

In particular, the following three issues are identified in DC environment in terms of transport performance.

4.1.1. Incast Collapse

For the sake of reduced CAPEX, cheap shallow-buffered ToR switches is currently and will be dominating in data centers. Hence it is quite easily that the buffer space of the ToR switch before an aggregator (the server who is responsible for dividing a task into a group of subtasks and collects responses from its relevant working servers for result aggregation) be consumed up the instance that workers submit their subtask through highly synchronized TCP flows, resulting in consistent packet loss over the affected flows. The resultant timeout would cause a dramatic performance degradation, since the regular RTT (less than 10ms) in data center is of magnitudes smaller than the traditional TCP RTO configuration (200ms).

4.1.2. Long tail of RTT

Due to the greedy nature of traditional TCP algorithms, the existence of large volume long flows would increasingly builds up queues in buffer space at intermediaries along the way, resulting considerable queuing delay at switches for short delay-senstive flows.

4.1.3. Buffer Pressure

Another affect of long queues in buffer space at intermediaries along the way is that it further reduces the actually available buffering space to accommodate bursty delay sensitive short flows, even if they are not submitted in the same time.

4.2. Transport Optimization Goals

Since both hardware and software devices are typically deployed and customized by a single DC operator, various private solutions for these issues are proposed, including cross-layer, cross-boundary (requiring cooperation between the network device and end hosts) ones.

In solving the above issues, various proposals are made in order to meet some of the following optimization goals:

(1) Reduce loss/timeout occurance: since TCP performance degradation is caused by packet losses/retransimision timeouts, it is proposed that by finer-tuned RTO configuration and finer-definition timing framework, the impact in result could be largely mitigated[Pannas]. In the meantime, there are work from IEEE DCB family, providing lossless ethernet service from the link layer, which could be rendered to avoid packet loss seen from the IP layer and has been demonstrated to be effective in a coupled solution for DC tranport optimization [detail]

(2) Mitigate impact from loss/timeout: delay-based CC algorithms are expected to be more robust to packet losses/timeout in mitigating incast collapse issue for DC[vegas]

(3) Avoid lengthy buffer queues: as queuing delay substantially impacts the RTT in DC environment, it is motivated to improve performance by keeping the buffering queues short[dctcp] or even empty[hull]. In order to do that, the sender may sense the queue at switches by explicit feedback (ECN [dctcp] or implicit delay variation (Vegas[vegas]).

(4) Delay prioritized buffer queuing: for resource bounded period, it is essential to make efficient use of limited resource to deliver the most desirable service rather than fair-sharing among all the competitors and fail them all ultimately. Proposals have been made to allow applications to explicitly indicate a flow's delivery preferences (either by absolute deadline information[d3] or by relative priorities[detail]), in order to improve the overall delivery success rate.

(5) Smooth traffic bursts: one one hand, (distributed) application would be refined to introduce random offset to avoid concurrent short flow submission peak; on the other hand, random offset would be introduced to RTO backoff calculation to mitigate retransmission synchronization [Pannas]. Moreover, physical pacing at NIC level is proposed to counter the effect of traffic bursts caused by general OS server performance optimization techniques[d2tcp]

5. DC Transport API Considerations

According to the above discussion, it is believed that the following information flows should be supported by optimized APIs between application to the core transport service.

5.1. Information Flow From The Above

(1) Delivery related: refers to the information from the application about its expectation on data delivery. For example, a explicit performance expectation could be specified by

(1.1) absolute delay requirement; or

(1.2) relative priority indication.

(2) Retransmission related: refers to the information from the application about how the tranport would deal with packet losses. For example, the information could include:

(2.1) loss recovery needed or not;

(2.2) if so, prefered retransmission timeout granularity;

(3) Pacing related: the information from the application about its expection about the flow for the applicability for pacing. For example, the information could include:

(3.1) traffic duration, in case of pacing for long flows only policy;

(3.2) burstyness expectation.

5.2. Information Flow From The Bottom

Congestion status: refers to information from the network device or local transport layer about the congestion status of the current transport connection/path.

6. Security Considerations

TBA.

7. IANA Considerations

TBA.

8. References

8.1. Normative References

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.

8.2. Informative References

[Pannas] Vasudevan, V., Phanishayee, A. and H. Shah, "Safe and effective fine-grained TCP retransmissions for datacenter communication", 2009.
[detail] Zats, D., Das, T. and P. Mohan, "DeTail: reducing the flow completion time tail in datacenter networks", 2012.
[vegas] Lee, C., Jang, K. and S. Moon, "Reviving delay-based TCP for data centers", 2012.
[dctcp] Alizadeh, M., Greenberg, A. and D. Maltz, "Data center tcp ", 2011.
[hull] Alizadeh, M., Kabbani, A. and T. Edsall, "Less is more: trading a little bandwidth for ultra-low latency in the data center", 2012.
[d2tcp] Vamanan, B., Hasan, J. and T. Vijaykumar, "Deadline-aware datacenter tcp (d2tcp)", 2012.
[d3] Wilson, C., Ballani, H. and T. Karagiannis, "Trading a little bandwidth for ultra-low latency in the data center", 2011.

Author's Address

Lingli Deng China Mobile EMail: denglingli@chinamobile.com