Internet DRAFT - draft-krishnan-large-flow-load-balancing

draft-krishnan-large-flow-load-balancing



OPSAWG                                                      R. Krishnan
Internet Draft                                                S. Khanna
Intended status: Experimental                    Brocade Communications
Expires: April 2013                                       B. Khasnabish
October 14, 2012                                        ZTE Corporation
                                                            A. Ghanwani
                                                                   Dell

   Best Practices for Optimal LAG/ECMP Component Link Utilization in
   Provider Backbone networks

                  draft-krishnan-large-flow-load-balancing-01.txt

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79. This document may not be modified,
   and derivative works of it may not be created, and it may not be
   published except as an Internet-Draft.

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79. This document may not be modified,
   and derivative works of it may not be created, except to publish it
   as an RFC and to translate it into languages other than English.

   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November
   10, 2008. The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.





Krishnan                Expires April 14, 2013                 [Page 1]

Internet-Draft   Best Practices for Optimal LAG/ECMP Component Link
          Utilization in Provider Backbone networks        October 2012


   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html

   This Internet-Draft will expire on April 14, 2009.

Copyright Notice

   Copyright (c) 2012 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document. Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.

Abstract

   The demands on the networking infrastructure are growing
   exponentially; the drivers are bandwidth hungry rich media
   applications, inter data center communications etc. In this context,
   it is important to optimally use the bandwidth in the service
   provider backbone networks which extensively use LAG/ECMP techniques
   for bandwidth scaling. This internet draft describes the issues faced
   in the service provider backbone in the context of LAG/ECMP and
   formulates best practice recommendations for managing the bandwidth
   efficiently in the service provider backbone.




Krishnan                Expires April 14, 2013                 [Page 2]

Internet-Draft   Best Practices for Optimal LAG/ECMP Component Link
          Utilization in Provider Backbone networks        October 2012


Table of Contents


   1. Introduction...................................................3
      1.1. Conventions used..........................................4
   2. Sub-optimal LAG/ECMP Component Link Utilization in the current
   framework.........................................................5
   3. Best practices for optimal LAG/ECMP Component Link Utilization.7
      3.1. Long-lived Large Flow Identification......................8
         3.1.1. sFlow/Netflow........................................8
         3.1.2. Automatic hardware identification....................9
            3.1.2.1. Suggested Technique for Automatic Hardware
            Identification..........................................10
      3.2. Long-lived Large Flow Re-balancing.......................10
         3.2.1. No re-balancing of short-lived small flows..........10
         3.2.2. Other Techniques....................................11
         3.2.3. Re-balancing of long-lived large flows and short-lived
         small flows - an example...................................11
   4. Acknowledgements..............................................12
   5. IANA Considerations...........................................12
   6. Security Considerations.......................................13
   7. References....................................................13
      7.1. Normative References.....................................13
      7.2. Informative References...................................13

1. Introduction

   Service provider backbone networks extensively use LAG/ECMP
   techniques for bandwidth scaling. Network traffic can be
   predominantly categorized into two traffic types, long-lived large
   flows and short-lived small flows. Hashing techniques, which perform
   an approximate distribution of these flows across the LAG/ECMP
   component links, typically result in a sub-optimal utilization of
   LAG/ECMP component links. Round Robin load-balancing techniques
   address this problem but have the side effect of causing packet re-
   ordering. This internet draft recommends best practices for optimal
   LAG/ECMP component link utilization while using hashing techniques.
   These best practices comprise of the following; first is
   identification of long-lived large flows in routers and next is
   assigning the long-lived large flows to specific LAG/ECMP component
   links.







Krishnan                Expires April 14, 2013                 [Page 3]

Internet-Draft   Best Practices for Optimal LAG/ECMP Component Link
          Utilization in Provider Backbone networks        October 2012


1.1. Conventions used

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].


   The following acronyms are used:


   ECMP: Equal Cost Multi-path

   LAG: Link Aggregation Group

   QoS: Quality of Service

   MPLS: Multiprotocol Label Switching

   DOS: Denial of Service









































Krishnan                Expires April 14, 2013                 [Page 4]

Internet-Draft   Best Practices for Optimal LAG/ECMP Component Link
          Utilization in Provider Backbone networks        October 2012



2. Sub-optimal LAG/ECMP Component Link Utilization in the current framework

   Hashing techniques, which perform an approximate distribution of
   long-lived large flows and short-lived small flows across the
   LAG/ECMP component links, typically results in a sub-optimal
   utilization of LAG/ECMP component links. This is depicted in Figure 1
   with a detailed description below.

     .  There is a LAG between 2 routers R1 and R2. This LAG has 3
        component links (1), (2), (3)

     .  Component link (1) has 2 short-lived small flows and 1 long-
        lived large flow and the link capacity is optimally utilized

     .  Component link (2) has 3 short-lived small flows and no long-
        lived large flow and the link capacity is sub-optimally utilized

          o The absence of any long-lived large flow causes the
             component link under-utilization

     .  Component link (3) has 2 short-lived small flows and 2 long-
        lived large flows and the link capacity is over-utilized.

          o The presence of 2 long-lived large flows causes the
             component link over-utilization









Krishnan                Expires April 14, 2013                 [Page 5]

Internet-Draft   Best Practices for Optimal LAG/ECMP Component Link
          Utilization in Provider Backbone networks        October 2012








                  |-----------|        |-----------|

                  |           | -> ->  |           |

                  |           |=====>  |           |

                  |        (1)|--/---/-|(1)        |

                  |           |        |           |

                  |           |        |           |

                  |  (R1)    |-> -> ->|   (R2)  |

                  |        (2)|--/---/-|(2)        |

                  |           |        |           |

                  |           | -> ->  |           |

                  |           |=====>  |           |

                  |           |=====>  |           |

                  |        (3)|--/---/-|(3)        |

                  |           |        |           |

                  |-----------|        |-----------|

       Figure 1: Long-lived Large Flows - uneven distribution across

                         LAG/ECMP component links









Krishnan                Expires April 14, 2013                 [Page 6]

Internet-Draft   Best Practices for Optimal LAG/ECMP Component Link
          Utilization in Provider Backbone networks        October 2012




3. Best practices for optimal LAG/ECMP Component Link Utilization

   The suggested techniques in this draft for optimal LAG/ECMP component
   link utilization are meant to put forth a locally_ optimized
   solution, i.e. local in the sense of both measuring and optimizing
   for long-lived large flows at individual nodes in the network. This
   approach would not yield a globally optimal placement of a large,
   long-lived flow across several nodes in the network which some
   networks may desire/require. On the other hand, this may be adequate
   for some operators for the following reasons 1) Different links in
   the network experience different levels of utilization and, thus, a
   more "targeted" solution is needed for those few hot-spots in the
   network 2) Some networks may lack end-to-end visibility.

   The various steps in achieving optimal LAG/ECMP component link
   utilization in backbone networks are detailed below

   Step 1) This involves identifying long-lived large flows in the
   egress processing elements in routers; besides the flow parameters,
   this also involves identifying the egress component link the flow is
   using. The identification of long-lived large flows is explained in
   detail in section 4.1.

   Step 2) The egress component links are periodically scanned for link
   utilization. If the egress component link utilization exceeds a pre-
   programmed threshold, an operator alert is generated. The long-lived
   large flows mapping to the congested egress component link are
   exported to a central management entity. IETF could potentially
   consider a standards-based activity around, say, a data-model used to
   move this information from the router to the central management
   entity.

   Step 3) On receiving the alert about the congested component link,
   the operator, through a central management entity finds out the long-
   lived large flows mapping to the component link and the LAG/ECMP
   group to which the component link maps to.

   Step 4) The operator can choose to rebalance the long-lived large
   flows on lightly loaded component links of the LAG/ECMP group. The
   operator, through a central management entity 1) Can indicate
   specific long-lived large flows to rebalance 2) Let the router decide
   the best long-lived large flows to rebalance. The central management
   entity conveys the above information to the router. IETF could


Krishnan                Expires April 14, 2013                 [Page 7]

Internet-Draft   Best Practices for Optimal LAG/ECMP Component Link
          Utilization in Provider Backbone networks        October 2012


   potentially consider a standards-based activity around, say, a data-
   model used to move this information from the central management
   entity to the router. The re-balancing of long-lived large flows is
   explained in detail in section 4.2.

   Optionally, if desired, steps 2) to 4) could be automated resulting
   in automatic rebalancing of long-lived large flows.

3.1. Long-lived Large Flow Identification

   A flow (long-lived large flow or short-lived small flow) can be
   defined using one of the following suggested formats as described
   below

     .  IP 5 tuple: IP Protocol, IP source address, IP destination
        address, TCP/UDP source port, TCP/UDP destination port

     .  IP 3 tuple: IP Protocol,  IP source address, IP destination
        address

     .  MPLS Labels

     .  VXLAN, NVGRE

     .  Other formats

   The best practices described in this document are agnostic to the
   format of the flow.

3.1.1. sFlow/Netflow

   Enable sFlow/Netflow sampling on all the egress ports in the routers.
   Through sFlow processing in a sFlow Collector, an approximate
   indication of large flows mapping to each of the component links in
   each LAG/ECMP group is available. The advantages and disadvantages of
   sFlow/Netflow are detailed below.

   Advantages of sFlow/Netflow

     .  Supported in most routers

     .  Minimal router resources

   Disadvantages of sFlow/Netflow

     .  Approximate identification of long-lived large flows


Krishnan                Expires April 14, 2013                 [Page 8]

Internet-Draft   Best Practices for Optimal LAG/ECMP Component Link
          Utilization in Provider Backbone networks        October 2012


     .  Non real-time identification of long-lived large flows based on
        historical analysis

   The time taken to determine a candidate long-lived large flow would
   be dependent on the amount of sFlow samples being generated and the
   processing power of the external sFlow collector; this is under
   further study.

3.1.2. Automatic hardware identification

   Implementations may choose to implement automatic identification of
   long-lived large flows in hardware in egress processing elements of
   routers. The characteristics of such an implementation would be

     .  Inline solution

     .  Minimal system resources

     .  Maintain line-rate performance

     .  Perform accounting of long-lived large flows with a high degree
        of accuracy

   Using automatic hardware identification of long-lived large flows, an
   accurate indication of large flows mapping to each of the component
   links in a LAG/ECMP group is available. The advantages and
   disadvantages of automatic hardware identification are detailed
   below.

   Advantages of Automatic Hardware Identification

     .  Accurate identification of long-lived large flows

     .  Real-time identification of long-lived large flows

   Disadvantages of Automatic Hardware Identification

     .  Not supported in many routers

   The measurement interval for determining a candidate long-lived large
   flow and the minimum bandwidth of the long-lived large flow would be
   programmable parameters in the router; this is under further study.

   The implementation of automatic hardware identification of long-lived
   large flows is vendor dependent. Below is a suggested technique.



Krishnan                Expires April 14, 2013                 [Page 9]

Internet-Draft   Best Practices for Optimal LAG/ECMP Component Link
          Utilization in Provider Backbone networks        October 2012


3.1.2.1. Suggested Technique for Automatic Hardware Identification

   There are multiple hash tables, each with a different hash function.
   Each hash table entry has an associated counter. On packet arrival, a
   new flow is looked up in parallel in all the hash tables and the
   corresponding counter is incremented. If the counter exceeds a
   programmed threshold in a given time interval in all the hash table
   entries, a candidate long-lived-flow is learnt and programmed in a
   hardware table resource like TCAM. There may be some false positives
   due to multiple short-lived small flows masquerading as a long-lived
   large flow; the amount of false positives is reduced by parallel
   hashing.

3.2. Long-lived Large Flow Re-balancing

   Below are suggested techniques for long-lived large flow re-
   balancing. Our suggestion is that the router vendors should implement
   all these techniques and the operator chooses the right technique
   based on various application needs. Perfect re-balancing of long-
   lived large flows may not be possible since flows can arrive and
   depart at different times.

3.2.1. No re-balancing of short-lived small flows

   In the LAG/ECMP group, choose other member component links with least
   average port utilization. Move the long-lived large flow(s) from the
   heavily loaded component link to the new member component links using
   a Policy based routing (PBR) rule in the ingress processing
   element(s) in the routers.

   The benefits of this algorithm are

     .  Short-lived small flows are not subjected to flow re-ordering

     .  Only certain long-lived large flows are subjected to flow re-
        ordering

   The disadvantages of this algorithm are

     .  There may be a Quality of Service (QoS) impact on the existing
        short-lived flows







Krishnan                Expires April 14, 2013                [Page 10]

Internet-Draft   Best Practices for Optimal LAG/ECMP Component Link
          Utilization in Provider Backbone networks        October 2012


3.2.2. Other Techniques

   It is possible use other algorithms, for example, removing a member
   component link from the LAG/ECMP group and using it only for long-
   lived large flows.

3.2.3. Re-balancing of long-lived large flows and short-lived small
   flows - an example

   Optimal LAG/ECMP component utilization for the use case in Figure 1,
   is depicted below in Figure 2. This is achieved as follows

   Step 1) Long-lived large flows are identified in the egress
   processing elements of router R1 using techniques suggested in
   Section 4.1.

   Step 2) An operator alert is generated indicating that egress
   component link (3) in router R1 is congested. The long-lived large
   flows mapping to the congested egress component link are exported
   from the router to a central management entity.

   Step 3) On receiving the alert about the congested component link
   (3), the operator, through a central management entity finds out the
   long-lived large flows mapping to the component link and the LAG/ECMP
   group to which the component link maps to.

   Step 4) The operator, through a central management entity, can choose
   to rebalance the long-lived large flows on lightly loaded component
   links of the LAG/ECMP group using the suggested techniques in Section
   4.2. In the router, a long-lived large flow is moved from component
   link (3) to component link (2) by using a PBR rule in the ingress
   processing element(s) in the routers.

   Detailed description for Figure 2 is as follows

     .  There is a LAG between 2 routers R1 and R2. This LAG has 3
        component links (1), (2), (3)

     .  Component link (1) has 2 short-lived small flows and 1 long-
        lived large flow and the link capacity is optimally utilized

     .  Component link (2) has 3 short-lived small flows and 1 long-
        lived large flow and the link capacity is optimally utilized

     .  Component link (3) has 2 short-lived small flows and 1 long-
        lived large flow and the link capacity is optimally utilized


Krishnan                Expires April 14, 2013                [Page 11]

Internet-Draft   Best Practices for Optimal LAG/ECMP Component Link
          Utilization in Provider Backbone networks        October 2012




                  |-----------|        |-----------|

                  |           | -> ->  |           |

                  |           |=====>  |           |

                  |        (1)|--/---/-|(1)        |

                  |           |        |           |

                  |           |=====>  |           |

                  |  (R1)    |-> -> ->|   (R2)  |

                  |        (2)|--/---/-|(2)        |

                  |           |        |           |

                  |           |        |           |

                  |           | -> ->  |           |

                  |           |=====>  |           |

                  |        (3)|--/---/-|(3)        |

                  |           |        |           |

                  |-----------|        |-----------|

        Figure 2: Long-lived Large Flows - even distribution across

                         LAG/ECMP component links

4. Acknowledgements

   The authors would like to thank Shane Amante for all the support and
   valuable input. The authors would also like to thank Fred Baker for
   his input.

5. IANA Considerations

   This memo includes no request to IANA.



Krishnan                Expires April 14, 2013                [Page 12]

Internet-Draft   Best Practices for Optimal LAG/ECMP Component Link
          Utilization in Provider Backbone networks        October 2012



6. Security Considerations

   This document does not directly impact the security of the Internet
   infrastructure or its applications. In fact, it could help if there
   is a DOS attack pattern which causes a hash imbalance resulting in
   heavy overloading of long-lived large flows to certain LAG/ECMP 
   component links.

7. References

7.1. Normative References

   [1]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
         Levels", BCP 14, RFC 2119, March 1997.

   [2]   Crocker, D. and Overell, P.(Editors), "Augmented BNF for Syntax
         Specifications: ABNF", RFC 2234, Internet Mail Consortium and
         Demon Internet Ltd., November 1997.

   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
             Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2234] Crocker, D. and Overell, P.(Editors), "Augmented BNF for
             Syntax Specifications: ABNF", RFC 2234, Internet Mail
             Consortium and Demon Internet Ltd., November 1997.

7.2. Informative References

   [I-D.ietf-rtgwg-cl-requirement] C. Villamizar et al., "Requirements
   for MPLS Over a Composite Link", June 2012

   [I-D.ietf-mpls-entropy-label] K. Kompella et al., "The Use of
   Entropy Labels in MPLS Forwarding", July 2012

   [I-D.kj-nvo3-pion-architecture] L. Jin and B. Khasnabish,
   "Architecture of PSN Independent Overlay Network(PION)," May 2012.

   Thaler, D. and C. Hopps, "Multipath Issues in Unicast and
   Multicast", RFC 2991, November 2000.



Krishnan                Expires April 14, 2013                [Page 13]

Internet-Draft   Best Practices for Optimal LAG/ECMP Component Link
          Utilization in Provider Backbone networks        October 2012


   Hopps, C., "Analysis of an Equal-Cost Multi-Path Algorithm",
   RFC 2992, November 2000.

   Newman, D. and T. Player, "Hash and Stuffing: Overlooked Factors in
   Network Device Benchmarking", RFC 4814, March 2007.



Authors' Addresses

   Ram Krishnan
   Brocade Communications
   San Jose, 95134, USA

   Phone: +001-408-406-7890
   Email: ramk@brocade.com


   Sanjay Khanna
   Brocade Communications
   San Jose, 95134, USA

   Phone: +001-408-333-4850
   Email: skhanna@brocade.com


   Anoop Ghanwani
   Dell
   San Jose, CA 95134

   Phone: (408) 571-3228
   Email: anoop@alumni.duke.edu



Krishnan                Expires April 14, 2013                [Page 14]

Internet-Draft   Best Practices for Optimal LAG/ECMP Component Link
          Utilization in Provider Backbone networks        October 2012


   Bhumip Khasnabish
   ZTE Corporation
   New Jersey, 07960, USA

   Phone: +001-781-752-8003
   Email: bhumip.khasnabish@zteusa.com




































Krishnan                Expires April 14, 2013                [Page 15]