Internet DRAFT - draft-dunbar-armd-arp-nd-scaling-practices
draft-dunbar-armd-arp-nd-scaling-practices
Network working Group                                L. Dunbar
Internet Draft                                          Huawei
Intended status: Informational                       W. Kumari
Expires: November 2014                                 Google
                                                Igor Gashinsky
                                                        Yahoo
                                                May 7, 2014
         Practices for scaling ARP and ND for large data centers
              draft-dunbar-armd-arp-nd-scaling-practices-08
Status of this Memo
   This Internet-Draft is submitted to IETF in full conformance with
   the provisions of BCP 78 and BCP 79.
   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that
   other groups may also distribute working documents as Internet-
   Drafts.
   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other
   documents at any time. It is inappropriate to use Internet-Drafts
   as reference material or to cite them other than as "work in
   progress."
   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.
   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.
   This Internet-Draft will expire on November 7, 2014.
Copyright Notice
   Copyright (c) 2014 IETF Trust and the persons identified as the
   document authors. All rights reserved.
   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with
   respect to this document.
Dunbar-Kumari-Gashinsky  Expires November 7, 2014 [Page 1]
Internet-Draft   Practices to scale ARP/ND in large DC
Abstract
   This draft documents some operational practices that allow ARP/ND
   to scale in data center environments.
Table of Contents
   1. Introduction...................................................3
   2. Terminology....................................................4
   3. Common DC network Designs......................................5
   4. Layer 3 to Access Switches.....................................5
   5. Layer 2 practices to scale ARP/ND..............................6
      5.1. Practices to alleviate APR/ND burden on L2/L3 boundary
      routers........................................................6
         5.1.1. Communicating with a peer in a different subnet......6
         5.1.2. L2/L3 boundary router processing of inbound traffic..7
         5.1.3. Inter subnets communications.........................8
      5.2. Static ARP/ND entries on switches.........................8
      5.3. ARP/ND Proxy approaches...................................9
      5.4. Multicast Scaling Issues.................................10
   6. Practices to scale ARP/ND in Overlay models...................10
   7. Summary and Recommendations...................................11
   8. Security Considerations.......................................11
   9. IANA Considerations...........................................11
   10. Acknowledgements.............................................12
   11. References...................................................12
      11.1. Normative References....................................12
      11.2. Informative References..................................13
   Authors' Addresses...............................................13
Dunbar-Kumari-Gashinsky   Expires November 7, 2014          [Page 2]
Internet-Draft   Practices to scale ARP/ND in large DC
1. Introduction
   This draft documents some operational practices that allow ARP/ND
   to scale in data center environments.
   As described in [RFC6820], the increasing trend of rapid workload
   shifting and server virtualization in modern data centers requires
   servers to be loaded (or re-loaded) with different VMs or
   applications at different times. Different VMs residing on one
   physical server may have different IP addresses, or may even be in
   different IP subnets.
   In order to allow a physical server to be loaded with VMs in
   different subnets, or VMs to be moved to different server racks
   without IP address re-configuration, the networks need to enable
   multiple broadcast domains (many VLANs) on the interfaces of L2/L3
   boundary routers and ToR switches and allow some subnets to span
   across multiple router ports.
   Note: The L2/L3 boundary routers in this draft are capable of
   forwarding IEEE802.1 Ethernet frames (layer 2) without MAC header
   change. When subnets span across multiple ports of those routers,
   they still fall under the category of single link, specifically the
   multi-access link model recommended by [RFC4903]. They are
   different from the ''multi-link'' subnets described in [Multi-Link]
   and RFC4903, which refer to different physical media with the same
   prefix connected to one router. Within the ''multi-link'' subnet
   described in RFC4903, layer 2 frames from one port cannot be
   natively forwarded to another port without a header change.
   Unfortunately, when the combined number of VMs (or hosts) in all
   those subnets is large, this can lead to address resolution (i.e.
   IPv4 ARP and IPv6 ND) scaling issues. There are three major issues
   associated with ARP/ND address resolution protocols when subnets
   span across multiple L2/L3 boundary router ports:
   1) the ARP/ND messages being flooded to many physical link segments
     which can reduce bandwidth utilization for user traffic;
   2) the ARP/ND processing load impact on the L2/L3 boundary routers;
   3) In IPv4, every end station in a subnet receives ARP broadcast
     messages from all other end stations in the subnet. IPv6 ND has
     eliminated this issue by using multicast.
   Since the majority of data center servers are moving towards 1G or
   10G ports, the bandwidth taken by ARP/ND messages, even when
   flooded to all physical links, becomes negligible compared to the
Dunbar-Kumari-Gashinsky   Expires November 7, 2014          [Page 3]
Internet-Draft   Practices to scale ARP/ND in large DC
   link bandwidth. In addition, the IGMP/MLD snooping [RFC4541] can
   further reduce the ND multicast traffic to some physical link
   segments.
   As modern servers' computing power increases, the processing taken
   by a large amount of ARP broadcast messages becomes less
   significant to servers. For example, lab testing shows that 2000
   ARP/s only takes 2% of a single core CPU server. Therefore, the
   impact of ARP broadcast impact to end stations is not significant
   on today's servers.
   Statistics done by Merit Network [ARMD-Statistics] have shown that
   the major impact of a large number of mobile VMs in Data Center is
   on the L2/L3 boundary routers, i.e. the issue #2 above.
   This draft documents some simple practices which can scale ARP/ND
   in a data center environment, especially in reducing processing
   load to L2/L3 boundary routers.
2. Terminology
   This document reuses much of terminology from
   [RFC6820]. Many of the definitions are presented here to aid the
   reader.
   ARP:     IPv4 Address Resolution Protocol [RFC826]
   Aggregation Switch: A Layer 2 switch interconnecting ToR switches
   Bridge:  IEEE802.1Q compliant device. In this draft, Bridge is used
            interchangeably with Layer 2 switch.
   DC:      Data Center
   DA:      Destination Address
   End Station:   VM or physical server, whose address is either the
            destination or the source of a data frame.
   EOR:     End of Row switches in data center.
   NA:      IPv6's Neighbor Advertisement
   ND:      IPv6's Neighbor Discovery [RFC4861]
   NS:      IPv6's Neighbor Solicitation
Dunbar-Kumari-Gashinsky   Expires November 7, 2014          [Page 4]
Internet-Draft   Practices to scale ARP/ND in large DC
   SA:      Source Address
   ToR:     Top of Rack Switch (also known as access switch).
   UNA:     IPv6's Unsolicited Neighbor Advertisement
   VM:      Virtual Machines
   Subnet   Refer to the Multi-access link subnet referenced by
            RFC4903
3. Common DC network Designs
   Some common network designs for a data center include:
     1) Layer 3 connectivity to the access switch,
     2) Large Layer 2, and
     3) Overlay models.
   There is no single network design that fits all cases.  The
   following sections document some of the common practices to scale
   Address Resolution under each network design.
4. Layer 3 to Access Switches
   This network design configures Layer 3 to the access switches;
   effectively making the access switches the L2/L3 boundary routers
   for the attached VMs.
   As described in [RFC6820], many data centers are architected so
   that ARP/ND broadcast/multicast messages are confined to a few
   ports (interfaces) of the access switches (i.e. ToR switches).
   Another variant of the Layer 3 solution is Layer 3 infrastructure
   configured all the way to servers (or even to the VMs), which
   confines the ARP/ND broadcast/multicast messages to the small
   number of VMs within the server.
   Advantage: Both ARP and ND scale well. There is no address
   resolution issue in this design.
   Disadvantage: The main disadvantage to this network design occurs
   during VM movement.  During VM movement, either VMs need an address
Dunbar-Kumari-Gashinsky   Expires November 7, 2014          [Page 5]
Internet-Draft   Practices to scale ARP/ND in large DC
   change or switches/routers need a configuration change when the VMs
   are moved to different locations.
   Summary: This solution is more suitable to data centers which have
   a static workload and/or network operators who can re-configure IP
   addresses/subnets on switches before any workload change.  No
   protocol changes are suggested.
5. Layer 2 practices to scale ARP/ND
  5.1. Practices to alleviate APR/ND burden on L2/L3 boundary routers
   The ARP/ND broadcast/multicast messages in a Layer 2 domain can
   negatively affect the L2/L3 boundary routers, especially with a
   large number of VMs and subnets. This section describes some
   commonly used practices in reducing the ARP/ND processing required
   on L2/L3 boundary routers.
   5.1.1. Communicating with a peer in a different subnet
   Scenario: When the originating end station doesn't have its default
   gateway MAC address in its ARP/ND cache and needs to communicate
   with a peer in a different subnet, it needs to send ARP/ND requests
   to its default gateway router to resolve the router's MAC address.
   If there are many subnets on the gateway router and a large number
   of end stations in those subnets that don't have gateway MAC in
   their ARP/ND caches, the gateway router has to process a very large
   number of ARP/ND requests. This is often CPU intensive as ARP/ND
   messages are usually processed by the CPU (and not in hardware).
   Note: Any centralized configuration which pre-loads the default MAC
   addresses is not included in this scenario.
   Solution: For IPv4 networks, a practice to alleviate this problem
   is to have the L2/L3 boundary router send periodic gratuitous ARP
   [GratuitousARP] messages, so that all the connected end stations
   can refresh their ARP caches. As a result, most (if not all) end
   stations will not need to send ARP requests for the gateway routers
   when they need to communicate with external peers.
   For the above scenario, IPv6 end stations are still required to
   send unicast ND messages to their default gateway router (even with
   those routers periodically sending Unsolicited Neighbor
   Advertisements) because IPv6 requires bi-directional path
   validation.
Dunbar-Kumari-Gashinsky   Expires November 7, 2014          [Page 6]
Internet-Draft   Practices to scale ARP/ND in large DC
   Advantage: Reduction of ARP requests to be processed by L2/L3
   boundary router for IPv4.
   Disadvantage: this practice doesn't reduce ND processing on L2/L3
   boundary router for IPv6 traffic.
   Recommendation: If the network is an IPv4-only network, then this
   approach can be used. For an IPv6 network, need to consider the
   work in progress described in [Impatient-NUD]. Note: The ND and
   SEND [RFC3971] use the bi-directional nature of queries to detect
   and prevent security attacks.
  5.1.2.L2/L3 boundary router processing of inbound traffic
   Scenario: When a L2/L3 boundary router receives a data frame
   destined for a local subnet and the destination is not in the
   router's ARP/ND cache, some routers hold the packet and trigger an
   ARP/ND request to resolve the L2 address. The router may need to
   send multiple ARP/ND requests until either a timeout is reached or
   an ARP/ND reply is received before forwarding the data packets
   towards the target's MAC address. This process is not only CPU
   intensive but also buffer intensive.
   Solution: To protect a router from being overburdened by resolving
   target MAC addresses, one solution is for the router to limit the
   rate of resolving target MAC addresses for inbound traffic whose
   target is not in the router's ARP/ND cache. When the rate is
   exceeded, the incoming traffic whose target is not in the ARP/ND
   cache is dropped.
   For an IPv4 network, another common practice to alleviate pain
   caused by this problem is for the router to snoop ARP messages
   between other hosts, so that its ARP cache can be refreshed with
   active addresses in the L2 domain. As a result, there is an
   increased likelihood of the router's ARP cache having the IP-MAC
   entry when it receives data frames from external peers. [RFC6820]
   section 7.1 provides a full description of this problem.
   For IPv6 end stations, routers are supposed to send RA unicast even
   if they have snooped UNA/NS/NA from those stations. Therefore, this
   practice allows an L2/L3 boundary to send unicast RA to target
   instead of multicast. [RFC6820] section 7.2 has a full description
   of this problem.
   Advantage: Reduction of the number of ARP requests that routers
   have to send upon receiving IPv4 packets and the number of IPv4
Dunbar-Kumari-Gashinsky   Expires November 7, 2014          [Page 7]
Internet-Draft   Practices to scale ARP/ND in large DC
   data frames from external peers that routers have to hold due to
   targets not in ARP cache.
   Disadvantage: For IPv6 traffic, the amount of ND processing on
   routers for IPv6 traffic is not reduced. IPv4 routers still need to
   hold data packets from external peers and trigger ARP requests if
   the targets of the data packets either don't exist or are not very
   active. In this case, IPv4 process or IPv4 buffers are not reduced.
   Recommendation: If there is a higher chance of routers receiving
   data packets that are destined for non-existing or inactive
   targets, alternative approaches should be considered.
   5.1.3.Inter subnets communications
   The router could be hit with ARP/ND twice when the originating and
   destination stations are in different subnets attached to the same
   router and those hosts don't communicate with external peers often
   enough. The first hit is when the originating station in subnet-A
   initiates an ARP/ND request to the L2/L3 boundary router if the
   router's MAC is not in the host's cache (5.1.1 above); and the
   second hit is when the L2/L3 boundary router initiates ARP/ND
   requests to the target in subnet-B if the target is not in router's
   ARP/ND cache (5.1.2 above).
   Again, practices described in 5.1.1 and 5.1.2 can alleviate some
   problems in some IPv4 networks.
   For IPv6 traffic, the practices don't reduce the ND processing on
   L2/L3 boundary routers.
   Recommendation: Consider the recommended approaches described in
   5.1.1 & 5.1.2. However, any solutions that relax the bi-directional
   requirement of IPv6 ND disable the security the two-way ND
   communication exchange provides.
  5.2. Static ARP/ND entries on switches
   In a datacenter environment the placement of L2 and L3 addressing
   may be orchestrated by Server (or VM) Management System(s).
   Therefore it may be possible for static ARP/ND entries to be
   configured on routers and / or servers.
   Advantage: This methodology has been used to reduce ARP/ND
   fluctuations in large scale data center networks.
Dunbar-Kumari-Gashinsky   Expires November 7, 2014          [Page 8]
Internet-Draft   Practices to scale ARP/ND in large DC
   Disadvantage: When some VMs are added, deleted, or moved, many
   switches' static entries need to be updated. In a Data Center with
   virtualized servers, those events can happen frequently. For
   example, for an event of one VM being added to one server, if the
   subnet of this VM spans across 15 access switches, all of them need
   to be updated. Network management (SNMP, netconf, or proprietary)
   mechanisms are available to provide updates or incremental updates.
   However, there is no well defined approach for switches to
   synchronize their content with the management system for efficient
   incremental update.
   Recommendation: Additional work may be needed within IETF (e.g.
   netconf, NVo3, IR2S, etc.) to get prompt incremental updates of
   static ARP/ND entries when changes occur.
  5.3. ARP/ND Proxy approaches
   RFC1027 [RFC1027] specifies one ARP proxy approach. However,
   RFC1027 is not a scaling mechanism. Since the publication of
   RFC1027 in 1987 many variants of ARP proxy have been deployed.
   RFC1027's ARP Proxy is for a gateway to return its own MAC address
   on behalf of the target station.
   [ARP_Reduction] describes a type of ''ARP Proxy'' which is for a ToR
   switch to snoop ARP requests and return the target station's MAC if
   the ToR has the information in its cache. However, [RFC4903]
   doesn't recommend the caching approach described in [ARP_Reduction]
   because such a cache prevents any type of fast mobility between
   layer 2 ports, and breaks Secure neighbor Discovery [RFC3971].
   IPv6 ND Proxy [RFC4389] specifies a proxy used between Ethernet
   segment and other segments, such as wireless or PPP segments. ND
   Proxy [RFC4389] doesn't allow a proxy to send NA on behalf of the
   target to ensure that the proxy does not interfere with hosts
   moving from one segment to another. Therefore, the ND Proxy
   [RFC4389] doesn't reduce the number of ND messages to L2/L3
   boundary router.
   Bottom line, the term ''ARP/ND Proxy'' has different interpretations
   depending on vendors and/or environments.
   Recommendation: For IPv4, even though those Proxy ARP variants (not
   RFC1023) have been used to reduce ARP traffic in various
   environments, there are many issues with caching.
   IETF should consider making proxy recommendations for Data Center
   environment as a transition issue to help DC operators
   transitioning to IPv6. The ''Guideline for proxy developers''
Dunbar-Kumari-Gashinsky   Expires November 7, 2014          [Page 9]
Internet-Draft   Practices to scale ARP/ND in large DC
   [RFC4389] should be considered when develop any new proxy protocols
   to scale ARP.
  5.4. Multicast Scaling Issues
   Multicast snooping (IGMP/MLD) has different implementations and
   scaling issues. [RFC4541] notes that multicast IGMPv2/v3 snooping
   has trouble with subnets that include IGMPv2 and IGMPv3. [RFC4541]
   also notes that MLDv2 snooping requires use of either DMAC address
   filtering or deeper inspection of frames/packet to allow for
   scaling.
   MLDv2 snooping needs to be re-examined for scaling within the DC.
   Efforts such as IGMP/MLD explicit tracking [IGMP-MLD-tracking] for
   downstream host need to provide better scaling than IGMP/MLDv2
   snooping.
6. Practices to scale ARP/ND in Overlay models
   There are several drafts on using overlay networks to scale large
   layer 2 networks (or avoid the need for large L2 networks) and
   enable mobility (e.g. draft-wkumari-dcops-l3-vmmobility-00, draft-
   mahalingam-dutt-dcops-vxlan-00). TRILL and IEEE802.1ah (Mac-in-Mac)
   are other types of overlay network to scale Layer 2.
   Overlay networks hide the VMs' addresses from the interior switches
   and routers, thereby greatly reduces the number of addresses
   exposed to the interior switches and router. The Overlay Edge nodes
   that perform the network address encapsulation/decapsulation still
   handle all remote stations addresses that communicate with the
   locally attached end stations.
   For a large data center with many applications, these applications'
   IP addresses need to be reachable by external peers. Therefore, the
   overlay network may have a bottleneck at the Gateway node(s) in
   processing resolving target stations' physical address (MAC or IP)
   and the overlay edge address within the data center.
   Here are some approaches that can be used to minimize the problem:
     1. Use static mapping as described in Section 5.2.
     2. Have multiple L2/L3 boundary nodes (i.e. routers), with each
        handling a subset of stations addresses which are visible to
        external peers (e.g. Gateway #1 handles a set of prefixes,
        Gateway #2 handles another subset of prefixes, etc.).
Dunbar-Kumari-Gashinsky   Expires November 7, 2014          [Page 10]
Internet-Draft   Practices to scale ARP/ND in large DC
7. Summary and Recommendations
   This memo describes some common practices which can alleviate the
   impact of address resolution on L2/L3 gateway routers.
   In Data Centers, no single solution fits all deployments. This
   memo has summarized some practices in various scenarios and the
   advantages and disadvantages about all of these practices.
   In some of these scenarios, the common practices could be improved
   by creating and/or extending existing IETF protocols. These
   protocol change recommendations are:
      . Relax some bi-directional requirement of IPv6 ND in some
        environment. However, other issues will be introduced when
        the bi-directional requirement of ND is relaxed. Therefore,
        it is necessary to have comprehensive study in making those
        changes.
      . Create an incremental ''update'' schemes for efficient static
        ARP/ND entries.
      . Develop IPv4 ARP/IPv6 ND Proxy standards for use in the data
        center. The ''Guideline for proxy developers'' [RFC4389] should
        be considered when develop any new proxy protocols to scale
        ARP/ND.
      . Consider scaling issues with IGMP/MLD snooping to determine
        if new alternatives can provide better scaling.
8. Security Considerations
   This draft documents existing solutions and proposes additional
   work that could be initiated to extend various IETF protocols to
   better scale ARP/ND for the data center environment.
   The security is a major issue for data center environment.
   Therefore, security should be seriously considered when developing
   any future protocol extension.
9. IANA Considerations
   This document does not request any action from IANA.
Dunbar-Kumari-Gashinsky   Expires November 7, 2014          [Page 11]
Internet-Draft   Practices to scale ARP/ND in large DC
10. Acknowledgements
   We want to acknowledge ARMD WG and the following people for their
   valuable inputs to this draft: Joel Jaeggli, Dave Thaler, Susan
   Hares, Benson Schliesser, T. Sridhar, Ron Bonica, Kireeti Kompella,
   and K.K.Ramakrishnan.
11. References
  11.1. Normative References
   [GratuitousARP] S. Cheshire, ''IPv4 Address Conflict Detection'', RFC
             5227, July 2008.
   [IGMP-MLD-tracking] H. Aseda, and N. Leymann, ''IGMP/MLD-Based
             Explicit Membership Tracking Function for Multicast
             Routers'' (http://tools.ietf.org/html/draft-ietf-pim-
             explicit-tracking-02), Oct, 2012.
   [RFC826] D.C. Plummer, ''An Ethernet address resolution protocol.''
             RFC826, Nov 1982.
   [RFC1027] Mitchell, et al, ''Using ARP to Implement Transparent
             Subnet Gateways''
             (http://datatracker.ietf.org/doc/rfc1027/)
   [RFC3971] Arkko, et al, ''Secure Neighbor Discovery (SEND)'',
             RFC3971, March 2005
   [RFC4389] Thaler, et al, ''Neighbor Discovery Proxies (ND Proxy)'',
             RFC4389, April 2006
   [RFC4541] Christensen, et al, ''Considerations for Internet Group
             Management Protocol (IGMP) and Multicast Listener
             Discovery (MLD) Snooping Switches'', RFC 4541, May 2006
   [RFC4861] Narten, et al, ''Neighbor Discovery for IP version 6
             (IPv6)'', RFC4861, Sept 2007
   [RFC4903] Thaler, ''Multilink Subnet Issues'', RFC4903, July 2007
   [RFC6820] Narten, et al, ''Address Resolution Problems in Large Data
             Center Networks'', RFC6820, Jan 2013
Dunbar-Kumari-Gashinsky   Expires November 7, 2014          [Page 12]
Internet-Draft   Practices to scale ARP/ND in large DC
  11.2. Informative References
    [Impatient-NUD] E. Nordmark, I. Gashinsky, ''draft-ietf-6man-
             impatient-nud''
    [ARMD-Statistics] M. Karir, J. Rees, ''Address Resolution
             Statistics'', draft-karir-armd-statistics-01.txt
             (expired), July 2011.
             https://datatracker.ietf.org/doc/draft-karir-armd-
             statistics/
    [ARP_Reduction] Shah, et al, ''ARP Broadcast Reduction for Large
             Data Centers'', draft-shah-armd-arp-reduction-02.txt
             (expired), Oct 2011.
             https://datatracker.ietf.org/doc/draft-shah-armd-arp-
             reduction/
    [Multi-Link] Thaler, et al, ''Multi-link Subnet Support in IPv6'',
             draft-ietf-ipv6-multilink-subnets-00.txt (expired), Dec
             2002. https://datatracker.ietf.org/doc/draft-ietf-ipv6-
             multilink-subnets/
Authors' Addresses
   Linda Dunbar
   Huawei Technologies
   5340 Legacy Drive, Suite 175
   Plano, TX 75024, USA
   Phone: (469) 277 5840
   Email: ldunbar@huawei.com
   Warren Kumari
   Google
   1600 Amphitheatre Parkway
   Mountain View, CA 94043
   US
   Email: warren@kumari.net
   Igor Gashinsky
   Yahoo
   45 West 18th Street 6th floor
   New York, NY 10011
   Email: igor@yahoo-inc.com
Dunbar-Kumari-Gashinsky   Expires November 7, 2014          [Page 13]
Internet-Draft   Practices to scale ARP/ND in large DC
Dunbar-Kumari-Gashinsky   Expires November 7, 2014          [Page 14]