Internet DRAFT - draft-deng-taps-datacenter
draft-deng-taps-datacenter
TAPS Bof Lingli Deng
INTERNET-DRAFT China Mobile
Expires: September 3, 2014 March 3, 2014
Considerations on Transport Services API for Data Centers
draft-deng-taps-datacenter-01
Abstract
It is noticed that within a data center, unique traffic pattern and
performance goals for the transport layer exist, as compared to
things on the Internet. This draft discusses the usecases for
applying transport APIs from the perspective of an application
running in a data center environment, and proposes potential
requirements for such API design.
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
Copyright and License Notice
Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
<Deng, et al.> Expires September 3, 2014 [Page 1]
INTERNET DRAFT <ISP TURN For WEBRTC> March 3, 2014
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Usecases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 Web Search . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 VM Related Traffic . . . . . . . . . . . . . . . . . . . . . 4
3.3 Application Priorities . . . . . . . . . . . . . . . . . . . 4
3.4 Access Type Differentiation . . . . . . . . . . . . . . . . 4
3.5 Delay Tolerant Traffic . . . . . . . . . . . . . . . . . . . 4
4 Transport Optimization in DC . . . . . . . . . . . . . . . . . 4
4.1 Performance degradation in DC . . . . . . . . . . . . . . . 5
4.1.1 Incast Collapse . . . . . . . . . . . . . . . . . . . . 5
4.1.2 Long tail of RTT . . . . . . . . . . . . . . . . . . . . 5
4.1.3 Buffer Pressure . . . . . . . . . . . . . . . . . . . . 5
4.2 Transport Optimization Goals . . . . . . . . . . . . . . . . 5
5 DC Transport API Considerations . . . . . . . . . . . . . . . . 6
5.1 Information Flow From The Above . . . . . . . . . . . . . . 6
5.2 Information Flow From The Bottom . . . . . . . . . . . . . . 7
6 Security Considerations . . . . . . . . . . . . . . . . . . . . 7
7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 7
8 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 7
9 References . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 9
<Deng, et al.> Expires September 3, 2014 [Page 2]
INTERNET DRAFT <ISP TURN For WEBRTC> March 3, 2014
1 Introduction
It is noticed that within a data center, unique traffic pattern and
performance goals for the transport layer exist, as compared to
things on the Internet. This draft discusses the usecases for
applying transport APIs from the perspective of an application
running in a data center environment, and proposes potential
requirements for such API design.
2 Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
DC: Data Center, is a facility used to house computer systems and
associated components, such as telecommunications and storage
systems.
ToR: a Top of Rack switch, usually sits on top of a rack of servers
and serves as the entrance to other parts of the data center
networking as well as inter-connecting the local servers within the
rack.
VM: Virtual Machine, is a software implementation of a machine (i.e.
a computer) that executes programs like a physical machine.
VM Migration: Virtual Machine Migration, refers to the process of
moving a running virtual machine or application between different
physical machines.
NIC: Network Interface Controller, is a computer hardware component
that connects a computer to a computer network.
DCB: Data center bridging, refers to a set of enhancements to
Ethernet local area networks for use in data center environments,
such as lossless ethernet.
3 Usecases
This sections presents usecases for optimized data delivery within a
data center.
3.1 Web Search It is identified that within a productive data center
hosting a web search engine, there are three types of TCP traffic:
(1) highly delay-sensitive short flows resultant from the distributed
computing model employed pervasively for interactive Internet
application (web search/social networking); (2) highly delay-
<Deng, et al.> Expires September 3, 2014 [Page 3]
INTERNET DRAFT <ISP TURN For WEBRTC> March 3, 2014
sensitive short flows for cluster control/management; and (3) delay
tolerant background flows for backup/synchronization with
considerably large data volume.
3.2 VM Related Traffic
In virtualized data centers, to cope with the reliability concerns
arising from the relatively unreliable general commodity hardware
platforms, keeping several identical VM instances running on
different physical servers for each other's backup is common
practice. In such case, TCP flows for VM backup or migration,
although considerably larger in data volume and longer in duration
than typical user traffic, are also delay sensitive.
3.3 Application Priorities
For data center accommodating multiple applications, one would
certainly prefer differentiation in resource provision in case of
congestion, according to the DC operator's provisioning policy or the
application's own feature.
For instance, if physical resources in a data center be shared
between a delay-sensitive web search engine and a relatively delay-
tolerant document/music sharing application, both application's data
traffic share the links from loader-balancer to servers and from
servers to database are multiplexed on the internal DC network.
3.4 Access Type Differentiation
Given various access types for a specific application, the DC
operator may want to enforce different QoS policies to some selected
group of users, according to their access type. For instance, if the
service provider is currently marketing on the mobile market, it
could prioritize mobile traffic over fixed traffic.
For potential competing service providers, one may also want to
prioritize direct traffic from its own application over other third
party users.
3.5 Delay Tolerant Traffic Delay tolerant traffic, including background
software upgrade and other management traffic, such as active
measurement data traffic for performance monitory/fault detection
should not impact any real productive traffic.
4 Transport Optimization in DC
To fully understand why we need special transport services for DC
<Deng, et al.> Expires September 3, 2014 [Page 4]
INTERNET DRAFT <ISP TURN For WEBRTC> March 3, 2014
environment as compared to Internet, it is better to look first at
what problems an optimized transport service would be from the
perspective of a DC application, beginning with the issues it faces
in terms of performance degradation.
4.1 Performance degradation in DC In particular, the following three
issues are identified in DC environment in terms of transport
performance.
4.1.1 Incast Collapse For the sake of reduced CAPEX, cheap shallow-
buffered ToR switches is currently and will be dominating in data
centers. Hence it is quite easily that the buffer space of the ToR
switch before an aggregator (the server who is responsible for
dividing a task into a group of subtasks and collects responses from
its relevant working servers for result aggregation) be consumed up
the instance that workers submit their subtask through highly
synchronized TCP flows, resulting in consistent packet loss over the
affected flows. The resultant timeout would cause a dramatic
performance degradation, since the regular RTT (less than 10ms) in
data center is of magnitudes smaller than the traditional TCP RTO
configuration (200ms).
4.1.2 Long tail of RTT
Different from typical Internet traffic, without queuing the end2end
RTT for a TCP flow is almost reduced to zero, as the propagation
delay is trivial (in us) for such short distance between any pair of
servers (typically within the distance of tens of meters). Due to the
greedy nature of traditional TCP algorithms, the existence of large
volume long flows would increasingly builds up queues in buffer space
at intermediaries along the way, resulting considerable queuing delay
at switches (in ms) for short delay-sensitive flows.
4.1.3 Buffer Pressure
Another affect of long queues in buffer space at intermediaries along
the way is that it further reduces the actually available buffering
space to accommodate bursty delay sensitive short flows, even if they
are not submitted in the same time.
4.2 Transport Optimization Goals Since both hardware and software
devices are typically deployed and customized by a single DC
operator, various private solutions for these issues are proposed,
including cross-layer, cross-boundary (requiring cooperation
between the network device and end hosts) ones.
In solving the above issues, various proposals are made in order to
meet some of the following optimization goals:
<Deng, et al.> Expires September 3, 2014 [Page 5]
INTERNET DRAFT <ISP TURN For WEBRTC> March 3, 2014
(1) Reduce loss/timeout occurance: since TCP performance degradation
is caused by packet losses/retransmission timeouts, it is proposed
that by finer-tuned RTO configuration and finer-definition timing
framework, the impact in result could be largely mitigated[Pannas].
In the meantime, there are work from IEEE DCB family, providing
lossless ethernet service from the link layer, which could be
rendered to avoid packet loss seen from the IP layer and has been
demonstrated to be effective in a coupled solution for DC transport
optimization[detail]. (2) Mitigate impact from loss/timeout: delay-
based CC algorithms are expected to be more robust to packet
losses/timeout in mitigating incast collapse issue[Vegas]. (3) Avoid
lengthy buffer queues: as queuing delay substantially impacts the RTT
in DC environment, it is motivated to improve performance by keeping
the buffering queues short[dctcp] or even empty[hull]. In order to do
that, the sender may sense the queue at switches by explicit feedback
(ECN by [dctcp]) or implicit delay variation (Vegas[vegas]). (4)
Delay prioritized buffer queuing: for resource bounded period, it is
essential to make efficient use of limited resource to deliver the
most desirable service rather than fair-sharing among all the
competitors and fail them all ultimately. Proposals have been made to
allow applications to explicitly indicate a flow's delivery
preferences (either by absolute deadline information [d3] or by
relative priorities[detail]), in order to improve the overall
delivery success rate. (5) Smooth traffic bursts: one one hand,
(distributed) application would be refined to introduce random offset
to avoid concurrent short flow submission peak; on the other hand,
random offset would be introduced to RTO backoff calculation to
mitigate retransmission synchronization [Pannas]. Moreover, physical
pacing at NIC level is proposed to counter the effect of traffic
bursts caused by general OS server performance optimization
techniques[d2tcp].
5 DC Transport API Considerations According to the above discussion, it
is believed that the following information flows should be supported
by optimized APIs between application to the core transport service.
5.1 Information Flow From The Above (1) Delivery related: refers to the
information from the application about its expectation on data
delivery. For example, a explicit performance expectation could be
specified by (1.1) absolute delay requirement; or (1.2) relative
priority indication.
(2) Retransmission related: refers to the information from the
application about how the transport would deal with packet losses.
For example, the information could include: (2.1) loss recovery
needed or not; (2.2) if so, preferred retransmission timeout
granularity.
<Deng, et al.> Expires September 3, 2014 [Page 6]
INTERNET DRAFT <ISP TURN For WEBRTC> March 3, 2014
(3) Pacing related: the information from the application about its
expectation about the flow for the applicability for pacing.
For example, the information could include: (3.1) traffic duration,
in case of pacing for long flows only policy; (3.2) burstyness
expectation.
5.2 Information Flow From The Bottom Congestion status: refers to
information from the network device or local transport layer about
the congestion status of the current transport connection/path.
6 Security Considerations
TBA.
7 Acknowledgements
The authors wish to thank Zhen Cao, Hui Deng and Michael Welzl for
providing comments, feedback, and improvement proposals on the
document.
8 IANA Considerations
There is no IANA action in this document.
<Deng, et al.> Expires September 3, 2014 [Page 7]
INTERNET DRAFT <ISP TURN For WEBRTC> March 3, 2014
9 References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[Pannas] Vasudevan V, Phanishayee A, Shah H, et al. Safe and
effective fine-grained TCP retransmissions for datacenter
communication[C]//ACM SIGCOMM computer communication
review. ACM, 2009, 39(4): 303-314.
[detail] Zats D, Das T, Mohan P, et al. DeTail: reducing the flow
completion time tail in datacenter networks[J]. ACM
SIGCOMM Computer Communication Review, 2012, 42(4): 139-
150.
[vegas] Lee C, Jang K, Moon S. Reviving delay-based TCP for data
centers[C]//Proceedings of the ACM SIGCOMM 2012 conference
on Applications, technologies, architectures, and
protocols for computer communication. ACM, 2012: 111-112.
[dctcp] Alizadeh M, Greenberg A, Maltz D A, et al. Data center tcp
(dctcp)[J]. ACM SIGCOMM computer communication review,
2011, 41(4): 63-74.
[hull] Alizadeh M, Kabbani A, Edsall T, et al. Less is more: trading
a little bandwidth for ultra-low latency in the data
center[C]//Proceedings of the 9th USENIX conference on
Networked Systems Design and Implementation. USENIX
Association, 2012: 19-19.
[d2tcp] Vamanan B, Hasan J, Vijaykumar T N. Deadline-aware datacenter
tcp (d2tcp)[J]. ACM SIGCOMM Computer Communication Review,
2012, 42(4): 115-126.
[d3] Alizadeh M, Kabbani A, Edsall T, et al. Less is more: trading a
little bandwidth for ultra-low latency in the data
center[C]//Proceedings of the 9th USENIX conference on
Networked Systems Design and Implementation. USENIX
Association, 2012: 19-19.
<Deng, et al.> Expires September 3, 2014 [Page 8]
INTERNET DRAFT <ISP TURN For WEBRTC> March 3, 2014
Authors' Addresses
Lingli Deng
China Mobile
Email: denglingli@chinamobile.com
Zhen Cao
China Mobile
Email: caozhen@chinamobile.com
Hui Deng
China Mobile
Email: denghui@chinamobile.com
<Deng, et al.> Expires September 3, 2014 [Page 9]