Internet DRAFT - draft-krishnan-large-flow-load-balancing
draft-krishnan-large-flow-load-balancing
OPSAWG R. Krishnan
Internet Draft S. Khanna
Intended status: Experimental Brocade Communications
Expires: April 2013 B. Khasnabish
October 14, 2012 ZTE Corporation
A. Ghanwani
Dell
Best Practices for Optimal LAG/ECMP Component Link Utilization in
Provider Backbone networks
draft-krishnan-large-flow-load-balancing-01.txt
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. This document may not be modified,
and derivative works of it may not be created, and it may not be
published except as an Internet-Draft.
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79. This document may not be modified,
and derivative works of it may not be created, except to publish it
as an RFC and to translate it into languages other than English.
This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other
than English.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Krishnan Expires April 14, 2013 [Page 1]
Internet-Draft Best Practices for Optimal LAG/ECMP Component Link
Utilization in Provider Backbone networks October 2012
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
This Internet-Draft will expire on April 14, 2009.
Copyright Notice
Copyright (c) 2012 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document.
Abstract
The demands on the networking infrastructure are growing
exponentially; the drivers are bandwidth hungry rich media
applications, inter data center communications etc. In this context,
it is important to optimally use the bandwidth in the service
provider backbone networks which extensively use LAG/ECMP techniques
for bandwidth scaling. This internet draft describes the issues faced
in the service provider backbone in the context of LAG/ECMP and
formulates best practice recommendations for managing the bandwidth
efficiently in the service provider backbone.
Krishnan Expires April 14, 2013 [Page 2]
Internet-Draft Best Practices for Optimal LAG/ECMP Component Link
Utilization in Provider Backbone networks October 2012
Table of Contents
1. Introduction...................................................3
1.1. Conventions used..........................................4
2. Sub-optimal LAG/ECMP Component Link Utilization in the current
framework.........................................................5
3. Best practices for optimal LAG/ECMP Component Link Utilization.7
3.1. Long-lived Large Flow Identification......................8
3.1.1. sFlow/Netflow........................................8
3.1.2. Automatic hardware identification....................9
3.1.2.1. Suggested Technique for Automatic Hardware
Identification..........................................10
3.2. Long-lived Large Flow Re-balancing.......................10
3.2.1. No re-balancing of short-lived small flows..........10
3.2.2. Other Techniques....................................11
3.2.3. Re-balancing of long-lived large flows and short-lived
small flows - an example...................................11
4. Acknowledgements..............................................12
5. IANA Considerations...........................................12
6. Security Considerations.......................................13
7. References....................................................13
7.1. Normative References.....................................13
7.2. Informative References...................................13
1. Introduction
Service provider backbone networks extensively use LAG/ECMP
techniques for bandwidth scaling. Network traffic can be
predominantly categorized into two traffic types, long-lived large
flows and short-lived small flows. Hashing techniques, which perform
an approximate distribution of these flows across the LAG/ECMP
component links, typically result in a sub-optimal utilization of
LAG/ECMP component links. Round Robin load-balancing techniques
address this problem but have the side effect of causing packet re-
ordering. This internet draft recommends best practices for optimal
LAG/ECMP component link utilization while using hashing techniques.
These best practices comprise of the following; first is
identification of long-lived large flows in routers and next is
assigning the long-lived large flows to specific LAG/ECMP component
links.
Krishnan Expires April 14, 2013 [Page 3]
Internet-Draft Best Practices for Optimal LAG/ECMP Component Link
Utilization in Provider Backbone networks October 2012
1.1. Conventions used
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
The following acronyms are used:
ECMP: Equal Cost Multi-path
LAG: Link Aggregation Group
QoS: Quality of Service
MPLS: Multiprotocol Label Switching
DOS: Denial of Service
Krishnan Expires April 14, 2013 [Page 4]
Internet-Draft Best Practices for Optimal LAG/ECMP Component Link
Utilization in Provider Backbone networks October 2012
2. Sub-optimal LAG/ECMP Component Link Utilization in the current framework
Hashing techniques, which perform an approximate distribution of
long-lived large flows and short-lived small flows across the
LAG/ECMP component links, typically results in a sub-optimal
utilization of LAG/ECMP component links. This is depicted in Figure 1
with a detailed description below.
. There is a LAG between 2 routers R1 and R2. This LAG has 3
component links (1), (2), (3)
. Component link (1) has 2 short-lived small flows and 1 long-
lived large flow and the link capacity is optimally utilized
. Component link (2) has 3 short-lived small flows and no long-
lived large flow and the link capacity is sub-optimally utilized
o The absence of any long-lived large flow causes the
component link under-utilization
. Component link (3) has 2 short-lived small flows and 2 long-
lived large flows and the link capacity is over-utilized.
o The presence of 2 long-lived large flows causes the
component link over-utilization
Krishnan Expires April 14, 2013 [Page 5]
Internet-Draft Best Practices for Optimal LAG/ECMP Component Link
Utilization in Provider Backbone networks October 2012
|-----------| |-----------|
| | -> -> | |
| |=====> | |
| (1)|--/---/-|(1) |
| | | |
| | | |
| (R1) |-> -> ->| (R2) |
| (2)|--/---/-|(2) |
| | | |
| | -> -> | |
| |=====> | |
| |=====> | |
| (3)|--/---/-|(3) |
| | | |
|-----------| |-----------|
Figure 1: Long-lived Large Flows - uneven distribution across
LAG/ECMP component links
Krishnan Expires April 14, 2013 [Page 6]
Internet-Draft Best Practices for Optimal LAG/ECMP Component Link
Utilization in Provider Backbone networks October 2012
3. Best practices for optimal LAG/ECMP Component Link Utilization
The suggested techniques in this draft for optimal LAG/ECMP component
link utilization are meant to put forth a locally_ optimized
solution, i.e. local in the sense of both measuring and optimizing
for long-lived large flows at individual nodes in the network. This
approach would not yield a globally optimal placement of a large,
long-lived flow across several nodes in the network which some
networks may desire/require. On the other hand, this may be adequate
for some operators for the following reasons 1) Different links in
the network experience different levels of utilization and, thus, a
more "targeted" solution is needed for those few hot-spots in the
network 2) Some networks may lack end-to-end visibility.
The various steps in achieving optimal LAG/ECMP component link
utilization in backbone networks are detailed below
Step 1) This involves identifying long-lived large flows in the
egress processing elements in routers; besides the flow parameters,
this also involves identifying the egress component link the flow is
using. The identification of long-lived large flows is explained in
detail in section 4.1.
Step 2) The egress component links are periodically scanned for link
utilization. If the egress component link utilization exceeds a pre-
programmed threshold, an operator alert is generated. The long-lived
large flows mapping to the congested egress component link are
exported to a central management entity. IETF could potentially
consider a standards-based activity around, say, a data-model used to
move this information from the router to the central management
entity.
Step 3) On receiving the alert about the congested component link,
the operator, through a central management entity finds out the long-
lived large flows mapping to the component link and the LAG/ECMP
group to which the component link maps to.
Step 4) The operator can choose to rebalance the long-lived large
flows on lightly loaded component links of the LAG/ECMP group. The
operator, through a central management entity 1) Can indicate
specific long-lived large flows to rebalance 2) Let the router decide
the best long-lived large flows to rebalance. The central management
entity conveys the above information to the router. IETF could
Krishnan Expires April 14, 2013 [Page 7]
Internet-Draft Best Practices for Optimal LAG/ECMP Component Link
Utilization in Provider Backbone networks October 2012
potentially consider a standards-based activity around, say, a data-
model used to move this information from the central management
entity to the router. The re-balancing of long-lived large flows is
explained in detail in section 4.2.
Optionally, if desired, steps 2) to 4) could be automated resulting
in automatic rebalancing of long-lived large flows.
3.1. Long-lived Large Flow Identification
A flow (long-lived large flow or short-lived small flow) can be
defined using one of the following suggested formats as described
below
. IP 5 tuple: IP Protocol, IP source address, IP destination
address, TCP/UDP source port, TCP/UDP destination port
. IP 3 tuple: IP Protocol, IP source address, IP destination
address
. MPLS Labels
. VXLAN, NVGRE
. Other formats
The best practices described in this document are agnostic to the
format of the flow.
3.1.1. sFlow/Netflow
Enable sFlow/Netflow sampling on all the egress ports in the routers.
Through sFlow processing in a sFlow Collector, an approximate
indication of large flows mapping to each of the component links in
each LAG/ECMP group is available. The advantages and disadvantages of
sFlow/Netflow are detailed below.
Advantages of sFlow/Netflow
. Supported in most routers
. Minimal router resources
Disadvantages of sFlow/Netflow
. Approximate identification of long-lived large flows
Krishnan Expires April 14, 2013 [Page 8]
Internet-Draft Best Practices for Optimal LAG/ECMP Component Link
Utilization in Provider Backbone networks October 2012
. Non real-time identification of long-lived large flows based on
historical analysis
The time taken to determine a candidate long-lived large flow would
be dependent on the amount of sFlow samples being generated and the
processing power of the external sFlow collector; this is under
further study.
3.1.2. Automatic hardware identification
Implementations may choose to implement automatic identification of
long-lived large flows in hardware in egress processing elements of
routers. The characteristics of such an implementation would be
. Inline solution
. Minimal system resources
. Maintain line-rate performance
. Perform accounting of long-lived large flows with a high degree
of accuracy
Using automatic hardware identification of long-lived large flows, an
accurate indication of large flows mapping to each of the component
links in a LAG/ECMP group is available. The advantages and
disadvantages of automatic hardware identification are detailed
below.
Advantages of Automatic Hardware Identification
. Accurate identification of long-lived large flows
. Real-time identification of long-lived large flows
Disadvantages of Automatic Hardware Identification
. Not supported in many routers
The measurement interval for determining a candidate long-lived large
flow and the minimum bandwidth of the long-lived large flow would be
programmable parameters in the router; this is under further study.
The implementation of automatic hardware identification of long-lived
large flows is vendor dependent. Below is a suggested technique.
Krishnan Expires April 14, 2013 [Page 9]
Internet-Draft Best Practices for Optimal LAG/ECMP Component Link
Utilization in Provider Backbone networks October 2012
3.1.2.1. Suggested Technique for Automatic Hardware Identification
There are multiple hash tables, each with a different hash function.
Each hash table entry has an associated counter. On packet arrival, a
new flow is looked up in parallel in all the hash tables and the
corresponding counter is incremented. If the counter exceeds a
programmed threshold in a given time interval in all the hash table
entries, a candidate long-lived-flow is learnt and programmed in a
hardware table resource like TCAM. There may be some false positives
due to multiple short-lived small flows masquerading as a long-lived
large flow; the amount of false positives is reduced by parallel
hashing.
3.2. Long-lived Large Flow Re-balancing
Below are suggested techniques for long-lived large flow re-
balancing. Our suggestion is that the router vendors should implement
all these techniques and the operator chooses the right technique
based on various application needs. Perfect re-balancing of long-
lived large flows may not be possible since flows can arrive and
depart at different times.
3.2.1. No re-balancing of short-lived small flows
In the LAG/ECMP group, choose other member component links with least
average port utilization. Move the long-lived large flow(s) from the
heavily loaded component link to the new member component links using
a Policy based routing (PBR) rule in the ingress processing
element(s) in the routers.
The benefits of this algorithm are
. Short-lived small flows are not subjected to flow re-ordering
. Only certain long-lived large flows are subjected to flow re-
ordering
The disadvantages of this algorithm are
. There may be a Quality of Service (QoS) impact on the existing
short-lived flows
Krishnan Expires April 14, 2013 [Page 10]
Internet-Draft Best Practices for Optimal LAG/ECMP Component Link
Utilization in Provider Backbone networks October 2012
3.2.2. Other Techniques
It is possible use other algorithms, for example, removing a member
component link from the LAG/ECMP group and using it only for long-
lived large flows.
3.2.3. Re-balancing of long-lived large flows and short-lived small
flows - an example
Optimal LAG/ECMP component utilization for the use case in Figure 1,
is depicted below in Figure 2. This is achieved as follows
Step 1) Long-lived large flows are identified in the egress
processing elements of router R1 using techniques suggested in
Section 4.1.
Step 2) An operator alert is generated indicating that egress
component link (3) in router R1 is congested. The long-lived large
flows mapping to the congested egress component link are exported
from the router to a central management entity.
Step 3) On receiving the alert about the congested component link
(3), the operator, through a central management entity finds out the
long-lived large flows mapping to the component link and the LAG/ECMP
group to which the component link maps to.
Step 4) The operator, through a central management entity, can choose
to rebalance the long-lived large flows on lightly loaded component
links of the LAG/ECMP group using the suggested techniques in Section
4.2. In the router, a long-lived large flow is moved from component
link (3) to component link (2) by using a PBR rule in the ingress
processing element(s) in the routers.
Detailed description for Figure 2 is as follows
. There is a LAG between 2 routers R1 and R2. This LAG has 3
component links (1), (2), (3)
. Component link (1) has 2 short-lived small flows and 1 long-
lived large flow and the link capacity is optimally utilized
. Component link (2) has 3 short-lived small flows and 1 long-
lived large flow and the link capacity is optimally utilized
. Component link (3) has 2 short-lived small flows and 1 long-
lived large flow and the link capacity is optimally utilized
Krishnan Expires April 14, 2013 [Page 11]
Internet-Draft Best Practices for Optimal LAG/ECMP Component Link
Utilization in Provider Backbone networks October 2012
|-----------| |-----------|
| | -> -> | |
| |=====> | |
| (1)|--/---/-|(1) |
| | | |
| |=====> | |
| (R1) |-> -> ->| (R2) |
| (2)|--/---/-|(2) |
| | | |
| | | |
| | -> -> | |
| |=====> | |
| (3)|--/---/-|(3) |
| | | |
|-----------| |-----------|
Figure 2: Long-lived Large Flows - even distribution across
LAG/ECMP component links
4. Acknowledgements
The authors would like to thank Shane Amante for all the support and
valuable input. The authors would also like to thank Fred Baker for
his input.
5. IANA Considerations
This memo includes no request to IANA.
Krishnan Expires April 14, 2013 [Page 12]
Internet-Draft Best Practices for Optimal LAG/ECMP Component Link
Utilization in Provider Backbone networks October 2012
6. Security Considerations
This document does not directly impact the security of the Internet
infrastructure or its applications. In fact, it could help if there
is a DOS attack pattern which causes a hash imbalance resulting in
heavy overloading of long-lived large flows to certain LAG/ECMP
component links.
7. References
7.1. Normative References
[1] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", BCP 14, RFC 2119, March 1997.
[2] Crocker, D. and Overell, P.(Editors), "Augmented BNF for Syntax
Specifications: ABNF", RFC 2234, Internet Mail Consortium and
Demon Internet Ltd., November 1997.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC2234] Crocker, D. and Overell, P.(Editors), "Augmented BNF for
Syntax Specifications: ABNF", RFC 2234, Internet Mail
Consortium and Demon Internet Ltd., November 1997.
7.2. Informative References
[I-D.ietf-rtgwg-cl-requirement] C. Villamizar et al., "Requirements
for MPLS Over a Composite Link", June 2012
[I-D.ietf-mpls-entropy-label] K. Kompella et al., "The Use of
Entropy Labels in MPLS Forwarding", July 2012
[I-D.kj-nvo3-pion-architecture] L. Jin and B. Khasnabish,
"Architecture of PSN Independent Overlay Network(PION)," May 2012.
Thaler, D. and C. Hopps, "Multipath Issues in Unicast and
Multicast", RFC 2991, November 2000.
Krishnan Expires April 14, 2013 [Page 13]
Internet-Draft Best Practices for Optimal LAG/ECMP Component Link
Utilization in Provider Backbone networks October 2012
Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm",
RFC 2992, November 2000.
Newman, D. and T. Player, "Hash and Stuffing: Overlooked Factors in
Network Device Benchmarking", RFC 4814, March 2007.
Authors' Addresses
Ram Krishnan
Brocade Communications
San Jose, 95134, USA
Phone: +001-408-406-7890
Email: ramk@brocade.com
Sanjay Khanna
Brocade Communications
San Jose, 95134, USA
Phone: +001-408-333-4850
Email: skhanna@brocade.com
Anoop Ghanwani
Dell
San Jose, CA 95134
Phone: (408) 571-3228
Email: anoop@alumni.duke.edu
Krishnan Expires April 14, 2013 [Page 14]
Internet-Draft Best Practices for Optimal LAG/ECMP Component Link
Utilization in Provider Backbone networks October 2012
Bhumip Khasnabish
ZTE Corporation
New Jersey, 07960, USA
Phone: +001-781-752-8003
Email: bhumip.khasnabish@zteusa.com
Krishnan Expires April 14, 2013 [Page 15]