Internet DRAFT - draft-szarecki-grow-abstract-nh-scaleout-peering

draft-szarecki-grow-abstract-nh-scaleout-peering







Internet Engineering Task Force                         R. Szarecki, Ed.
Internet-Draft                                          K. Vairavakkalai
Intended status: Informational                           N. Venkataraman
Expires: August 10, 2019                           Juniper Networks Inc.
                                                        February 6, 2019


          Use of Abstract NH in Scale-Out peering architecture
          draft-szarecki-grow-abstract-nh-scaleout-peering-00

Abstract

   Many large-scale service provider networks use some form of scale-out
   architecture at peering sites.  In such an architecture, each
   participating Autonomous System (AS) deploys multiple independent
   Autonomous System Border Routers (ASBRs) for peering, and Equal Cost
   Multi-Path (ECMP) load balancing is used between them.  There are
   numerous benefits to this architecture, including but not limited to
   N+1 redundancy and the ability to flexibly increase capacity as
   needed.  A cost of this architecture is an increase in the amount of
   state in both the control and data planes.  This has negative
   consequences for network convergence time and scale.

   In this document we describe how to mitigate these negative
   consequences through configuration of the routing protocols, both BGP
   and IGP, to utilize what we term the "Abstract Next-Hop" (ANH).  Use
   of ANH allows us to both reduce the number of BGP paths in the
   control plane and enable rapid path invalidation (hence, network
   convergence and traffic restoration).  We require no new protocol
   features to achieve these benefits.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on August 10, 2019.




Szarecki, et al.         Expires August 10, 2019                [Page 1]

Internet-Draft      Abstract NH in scale-out peering       February 2019


Copyright Notice

   Copyright (c) 2019 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
     1.1.  Scale-Out peering . . . . . . . . . . . . . . . . . . . .   4
       1.1.1.  Low latency . . . . . . . . . . . . . . . . . . . . .   4
       1.1.2.  All equal cost paths utilization  . . . . . . . . . .   4
       1.1.3.  Summary . . . . . . . . . . . . . . . . . . . . . . .   5
     1.2.  Common BGP Deployment Configurations  . . . . . . . . . .   7
       1.2.1.  IBGP with Next-Hop Unchanged  . . . . . . . . . . . .   7
         1.2.1.1.  Example . . . . . . . . . . . . . . . . . . . . .   7
       1.2.2.  IBGP with Next-Hop-Self . . . . . . . . . . . . . . .   8
   2.  The BGP Abstract Next-Hop . . . . . . . . . . . . . . . . . .   8
   3.  Use of Abstract Next-Hop in scale-out peering design  . . . .   9
     3.1.  Egress ASBR-Peer AS Abstract Next Hop (AP-ANH)  . . . . .  10
     3.2.  The Site-Peer AS Abstract Next Hop (SP-ANH) . . . . . . .  11
     3.3.  Assignment of Abstract Next Hops  . . . . . . . . . . . .  14
       3.3.1.  Native IP Networks  . . . . . . . . . . . . . . . . .  14
       3.3.2.  MPLS  . . . . . . . . . . . . . . . . . . . . . . . .  14
         3.3.2.1.  Identical BGP address space and paths received on
                   all ASBRs . . . . . . . . . . . . . . . . . . . .  14
         3.3.2.2.  Different address space sets or paths received on
                   different ASBRs . . . . . . . . . . . . . . . . .  14
       3.3.3.  SPRING  . . . . . . . . . . . . . . . . . . . . . . .  15
         3.3.3.1.  Identical BGP address space and path received on
                   all ASBRs . . . . . . . . . . . . . . . . . . . .  15
         3.3.3.2.  Different address space sets or paths received on
                   different ASBRs . . . . . . . . . . . . . . . . .  15
   4.  Worked Examples . . . . . . . . . . . . . . . . . . . . . . .  16
     4.1.  Failure of a proper subset of EBGP sessions with a given
           peer AS on a single ASBR  . . . . . . . . . . . . . . . .  16
     4.2.  Failure of a proper subset of EBGP sessions with a given
           peer AS on each ASBR of a given site  . . . . . . . . . .  16
     4.3.  Failure of all EBGP sessions with a given peer AS on



Szarecki, et al.         Expires August 10, 2019                [Page 2]

Internet-Draft      Abstract NH in scale-out peering       February 2019


           single ASBR; Failure of a single ASBR . . . . . . . . . .  17
     4.4.  All EBGP sessions with a given peer AS on all ASBRs . . .  17
   5.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  18
   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  18
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  18
   8.  Informative References  . . . . . . . . . . . . . . . . . . .  18
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  20

1.  Introduction

   Common to all large Internet networks are the requirements for large
   aggregate bandwidth and low latency.  As network sizes and traffic
   volumes have increased, it has become common to use scale-out
   architectures to satisfy these requirements.  Use of these techniques
   within individual networks is well-known.  Here, we explore a scale-
   out architecture for interconnecting different Autonomous Systems
   (ASes).

   Below, we show an example topology.  Content is hosted within AS 2,
   consumers connect via the various ISP Metro ASes.


   +---------------+      +----------------+       +---------------+
   |               |      |                +-------+               |
   |               +------+                +-------+      AS 30    |
   |               +------+                |       |   ISP Metro   |
   |               +------+                |  /----+               |
   |               |      |                | //----+               |
   |      AS 2     |      |       AS 1     |//     +---------------+
   |    Content    |      |  ISP BackBone  X/
   |   provider    +------+                X\
   |               +------+                |\\     +---------------+
   |               |      |                | \\----+               |
   |               |      |                |  \----+     AS 31     |
   |               +------+                |       |   ISP Metro   |
   |               +------+                +-------+               |
   |               +------+                +-------+               |
   +---------------+      +----------------+       +---------------+


                                 Figure 1

   ASes 1 and 2 are connected at multiple, geographically diverse,
   sites.  Geographic diversity is required for reasons including
   resiliency, minimization of latency, and minimization of cost
   associated with long-distance data transmission.





Szarecki, et al.         Expires August 10, 2019                [Page 3]

Internet-Draft      Abstract NH in scale-out peering       February 2019


1.1.  Scale-Out peering

   The same trends that have driven the use of scale-out architectures
   within ASes drive interest in using them at peering sites.  In such
   an architecture, each AS at the peering site deploys multiple
   independent Autonomous System Border Routers (ASBRs).  Benefits that
   can be realized include N+1 redundancy and the ability to flexibly
   increase capacity as needed.  The ASBRs are often connected to the
   rest of their AS in a leaf-spine topology through core routers, and
   augmented with a per-site pair of BGP route reflectors (RRs).  See
   for example SITE1 in Figure 2, below.

   The fundamental requirements in this architecture are:

   a.  Keep traffic on a path that has low latency.

   b.  Utilize all peering links that offer low latency.

   c.  In the event of failure, minimize the time needed to restore
       service.

1.1.1.  Low latency

   BGP, the Border Gateway Protocol, does not directly carry delay
   information.  We make the general assumption in this document that
   paths selected by the BGP best path algorithm [RFC4271] will provide
   lower latency than those not selected.  This assumption is not
   guaranteed to be true, but lacking special arrangements between
   peering ASes, it is what the protocol is able to provide.

1.1.2.  All equal cost paths utilization

   In order to use all links between peering ASes that provide the same
   BGP path costs to the destination prefix, at a minimum BGP speakers
   need to be enabled for multi-path operation.  Additionally, all AS
   ingress BGP speakers need to know at least all equal and best paths
   to the destination via multiple ASBRs.  If a full IBGP mesh is used,
   this happens naturally.  However, IBGP full meshes are uncommon in
   large networks and are even more impractical in scale-out
   architectures due to the high total number of ASBRs.

   The well-known techniques to deal with full-mesh scale challenges -
   Route Reflection [RFC4456] and Confederations [RFC5065] - hide
   redundant paths, as they advertise only a single selected path to
   their clients.  While this helps keep path and session scale
   manageable, it makes BGP multipath unusable.  We overcome this by
   using BGP ADD-PATH [RFC7911] between the RR and its clients (or among
   sub-ASes).



Szarecki, et al.         Expires August 10, 2019                [Page 4]

Internet-Draft      Abstract NH in scale-out peering       February 2019


1.1.3.  Summary

   In summary, for a scale-out peering architecture:

   o  BGP multipath needs to be enabled on all IBGP sessions inside the
      AS.

   o  BGP multipath needs to be enabled on all EBGP sessions of each
      ASBR.

   o  BGP ADD-PATH needs to be enabled on all IBGP sessions.

      *  RRs need to be able to send multiple paths per prefix.  The
         upper limit depends on:

         +  The maximum number of ASBRs per site (say N).

         +  Possibly also on the maximum number of EBGP sessions held by
            a single ASBR with single peer AS (say M), depending on BGP
            next-hop attribute (BGP-NH) configuration.

      *  RR clients/ASBRs may need to be able to send multiple paths per
         prefix if BGP-NH configuration is "next hop unchanged".  The
         upper limit depends on the maximum number of EBGP sessions held
         by a single ASBR with single peer AS (say M).

   For further consideration the following network diagram will be used
   for reference:























Szarecki, et al.         Expires August 10, 2019                [Page 5]

Internet-Draft      Abstract NH in scale-out peering       February 2019


   +------------------------------------------------------------------+
   |    AS 1                                    +--------------------+|
   |  +----------------------------------+      |+------+ SITE3 o--o ||
   |  | SITE1         +-------- Cost 10 -+------+|CR_3.1|--+  o-|RR| ||
   |  |   o------o    |                  |      |+------+  |  |Ro--o ||
   |  | O-|RR_1.1|    |                  |      |+------+  |  o--o   ||
   |  | |Ro------O    |          +--- Cost 10 --+|CR_3.K|  |+-------+||
   |  | O------O  +------+       |       |      |+---+--+  ||BR_3.N"|||
   |  |           |CR_1.1|-------+- Cost 10 -+  |    |     |+-------+||
   |  |           +------+       |       |   |  +----+-----+---------+|
   |  |           / / \      +------+    |   |    Cost 15 Cost 15     |
   |  |          / /   \     |CR_1.K|--Cost+ |  +----+-----+---------+|
   |  |         /  |    \    +------+   10 | |  |+---+--+  |   SITE2 ||
   |  |        /   |     \    /  | \     | | +--+|CR_2.K|  |  o--o   ||
   |  |       /    |      \--X-\ /  \    | |    |+------+  |  |RR|-o ||
   |  |      /  /--+--------/   X   |    | |    |+------+  |  o--oR| ||
   |  |     /  /   |   /-------/ \  \    | +----+|CR_2.1|<-+    o--o ||
   |  |    /  /    |  /           \  \   |      |+------+            ||
   |  | +------+ +------+       +------+ |      |+------+   +-------+||
   |  | |BR_1.1| |BR_1.2|-  -  -|BR_1.N| |      ||BR_2.1|   |BR_2.N'|||
   |  | +X----X+ +-X---X+       +-X---X+ |      |+-+--+-+   +-+---+-+||
   |  +---X----X----X---X--------X-----X-+      +--+--+-------+---+--+|
   +-------X----X----X---X-------+------X----------+--+-------+---+---+
            \    \   |    \      |       \----\    |  |       |   |
    BR_1.1   \    \  |     \-----+----------\  \   |  |       \   |
     ^        \-\  \-+-----------+-------\   \  \  \  \        \  \
     X BR_1.2    \   |           |        \   \  \  \  \        \  \
     X  ^         \  |           /         \   \ |   \  \       |  |
     X  X BR_1.N   \ \   /------/           \  | |    \  \      |  |
     X  X  ^        \ \ /                    \ | |     \  \     |  |
     X  X  X        | | |   ^ ^ ^            | | |      \  \    |  |
     X  X  X        | | |   | | |            | | |       \  \   |  |
   +---------+ +----+-+-+---+-+-+------------+-+-+--------X--X--+--+--+
   |         | |    | | |   | | |            | | |         \  \ |  |  |
   |         | |  +-+-+-++ ++-+-+-+        +------+       +------+ |  |
   |         | |  |PR_2.1| |PR_2.2|-  -  - |PR_2.M|       |PR_2.P+--+ |
   |         | |  +------+ +------+        +------+       +--+---+.T| |
   |         | |                                             +------+ |
   |   AS 3  | |                            AS 2                      |
   +---------+ +------------------------------------------------------+
   |==================================================================|
   |CR - Core Router                                                  |
   |BR - ASBR and/or Customer Edge in AS1                             |
   |PR - ASBR in peering ASes                                         |
   |==================================================================|

                                 Figure 2




Szarecki, et al.         Expires August 10, 2019                [Page 6]

Internet-Draft      Abstract NH in scale-out peering       February 2019


1.2.  Common BGP Deployment Configurations

1.2.1.  IBGP with Next-Hop Unchanged

   In one standard BGP configuration, an ASBR, when it advertises an
   externally learned prefix into IBGP, does not modify the BGP-NH.  So,
   the BGP-NH is set to the IP address of an interface on the external
   peering router.  The strength of this technique is the shorter time
   needed to restore connectivity with all equal cost multi-path (ECMP)
   in-use and on low latency paths.  The drawback is extremely high BGP
   Routing Information Base (RIB) scale - proportional to the number of
   inter-AS links.

1.2.1.1.  Example

   Let's assume that in the network of Figure 2, all PR2.x of AS2
   advertise the same set of prefixes on all sessions to AS1.

   If BR1.1-BR1.N and BR2.1-BR2.N' each advertise only one path per
   prefix to their respective RRs, then as the result of ADD-PATH among
   RRs, BRs and CRs, at site 3 the BRs and CRs will learn N+N' paths per
   prefix learned from AS2.  This is sufficient to equally distribute
   load among all N ASBRs on site 1 (note the IGP cost between site 2
   and site 3).

   However, when interfaces over which all BR1.1-BR_1.N learned their
   best path become unavailable (say interfaces to PR_2.1 in all cases,
   as a result of the failure of PR_2.1), the route to the BGP BGP-NH -
   that is, the IP address of the PR_2.1 interface - is removed from the
   IGP.  BGP speakers at other sites (BR_3.x) will react by temporarily
   directing traffic to site 2 (BR_2.1-BR_2.N').  This switchover may
   happen in sub-second time, in a prefix-scale-independent manner,
   thanks to techniques commonly known as BGP PIC Edge
   [I-D.ietf-rtgwg-bgp-pic].  As a result, traffic is on a path other
   than the lowest cost path, as the connection from site 1 to AS2 is
   not entirely broken (links to PR_2.2-PR_2.M are operational).

   Subsequently, all BR1.x will update their RRs with a new best path
   (say for PR_2.2) for each prefix (for example, 100,000 of them),
   triggering global convergence.  Such a convergence, for a large
   number of prefixes, may take many minutes.

   In the above example, BRs, RRs, and possibly CRs keep N+N' paths per
   prefix (N from site 1, and N' from site 2).  Provided N=N'=4, this
   makes 8 path per prefix.

   The solution for sub-optimal routing right after the failure would be
   to enable each BR to advertise multiple paths to its RRs, and for



Szarecki, et al.         Expires August 10, 2019                [Page 7]

Internet-Draft      Abstract NH in scale-out peering       February 2019


   them in turn to propagate it to all other RRs and hence BRs.  So,
   each of BR1.x at site 1 will advertise M paths (from PR_2.1-PR_2.M),
   RR1.x will have N*M ECMP best paths and advertise them to other sites
   (site 3).  As a result, BGP speakers at other sites (BR3.x at site 3)
   are provided with N*M paths per prefix from site 1 and N'*M' from
   site 2.  Therefore to achieve optimal routing immediately after
   failure, a considerably higher scale of BGP paths needs to be
   handled.  If M=N=N'=M'=4 then for each prefix we have 16 best paths
   and 16 non-best, a total of 32.  If AS2 advertises 100,000 prefixes,
   this becomes 3.2M paths.

   Although this solution provides a mean of fast, prefix-scale-
   independent traffic switchover, it does it only if an ASBR external
   interface goes down, which triggers an IGP event.  In case an EBGP
   session fails but the underlying interface remains up
   (misconfiguration, software defect, etc), recovery still requires
   per-prefix withdrawal/update that could take many minutes at high
   scale.

1.2.2.  IBGP with Next-Hop-Self

   The other common technique is to modify BGP-NH to "self" (a local IP
   address, typically a loopback) when the BR advertises an externally
   learned path into IBGP.  This technique allows the reduction of the
   number of paths per prefix, while keeping optimal forwarding - least
   cost and ECMP - in case of failure discussed above (e.g.  PR_2.1 node
   failure).  Actually, because IP addresses of BGP-NH as seen by other
   BGP speakers do not change in response to external failure events,
   and are resolvable by the IGP, there is no need to reprogram the
   Forwarding Information Base (FIB) at all.  Unfortunately, other
   failures - loss of all connectivity between a single BR (say BR1.1)
   and a peer AS (all PRs in AS2) would not be handled quickly.  As the
   BGP-NH advertised by BR_1.1 is not changed and is reachable by the
   IGP, BGP speakers in AS1 (BRs, CRs) will keep BR_1.1 as a feasible
   exit point until they receive BGP withdraws on a prefix-by-prefix
   basis.  This is a global convergence process that at high scale can
   take minutes, during which time packets may be discarded or loop.

2.  The BGP Abstract Next-Hop

   The Abstract Next Hop (ANH) concept presented below does not require
   any changes to the BGP protocol itself.  It is architectural solution
   to network configuration, that uses existing protocols' capabilities
   while achieving higher scale and faster routing convergence when
   scale-out peering sites exist.

   When a BGP speaker advertises a path to its IBGP peer, it modifies
   the Protocol Next-Hop to be the ANH value.  The ANH is just an IP



Szarecki, et al.         Expires August 10, 2019                [Page 8]

Internet-Draft      Abstract NH in scale-out peering       February 2019


   address that identifies the BGP session or a set of BGP sessions.
   The set of BGP sessions is defined by the operator in local
   configuration, according to network design needs.  For example, an
   ANH might identify:

   o  a set of BGP sessions with the same peer AS and handled by a given
      single ASBR

   o  a set of BGP sessions with same the peer AS and handled by one or
      more ASBRs at a given site

   o  a set of BGP sessions with any upstream provider AS

   o  a set of BGP sessions with a given peer device and handled by one
      or more of ASBRs of the local AS

   A host route to the ANH is installed in the relevant RIB and
   redistributed into the IGP.  BGP maintains the ANH host route based
   on the state of the associated group of BGP sessions:

   o  As soon as all BGP sessions in the set go down, the ANH route is
      removed.

   o  When at least one BGP session in of the set comes up, the ANH
      route is created only after initial route convergence is complete
      for the peer (End-of-RIB (EoR) [RFC4724] is received).

   Taken together, these procedures ensure that as soon as the final
   session in the set goes down, ingress routers will see the associated
   ANH withdrawn from the IGP.  Since the ANH is used to resolve the
   associated BGP next hops, the ingress routers are triggered to
   converge to send traffic to their alternate (new best) route.  They
   also ensure that as soon as one session in the set comes up and is
   synchronized (that is, the EoR is received), ingress routers will see
   the ANH advertised in the IGP and will be able to reconverge to use
   routes that are associated with that next hop.

   The ANH can be any IP address that the router is eligible to
   advertise according to the local network's IP address management
   scheme.  More details are given in Section 3.3.

3.  Use of Abstract Next-Hop in scale-out peering design

   In traditional configurations as described in Section 1.2 the meaning
   of the BGP-NH is either:

   o  An egress interface in the case of next-hop-unchanged
      configuration, or



Szarecki, et al.         Expires August 10, 2019                [Page 9]

Internet-Draft      Abstract NH in scale-out peering       February 2019


   o  An egress ASBR in the case of next-hop-self configuration.

   The meaning of Abstract Next Hop is more context-dependent.  This
   document describes network configurations when the BGP-NH identifies:

   a.  An (egress ASBR, peer AS) pair.  The ANH should be advertised
       into the IGP if, and only if, the given egress ASBR has at least
       one EBGP session in the ESTABLISHED state with the given peer AS,
       and the EoR marker has been received on that session.  We call
       this the ASBR-Peer AS Abstract Next Hop (AP-ANH).

   b.  An (egress site in local AS, peer AS) pair, where a "site" may
       include multiple ASBRs.  The ANH should be advertised into the
       IGP if, and only if, at least one ASBR of the given site has at
       least one EBGP session in the ESTABLISHED state with the given
       peer AS, and the EoR marker has been received on this session.
       We call this the Site-Peer AS Abstract Next Hop (SP-ANH).

   Note that reachability of the ANH address in the IGP depends on EBGP
   session state and not inter-AS interface state, although of course,
   interface state may impact session state.  How the IP route to the
   ANH address is instantiated on an ASBR and inserted into the IGP on
   particular device is a matter of local implementation.

3.1.  Egress ASBR-Peer AS Abstract Next Hop (AP-ANH)

   The AP-ANH is unique to an ASBR and its peer AS.  For example, in the
   network of Figure 2, BR_1.1 would have two AP-ANH assigned - one for
   its peering with AS2 and the other for AS3.  Similarly, BR_1.2 would
   have two AP-ANH, one per peer AS, with values different from the AP-
   ANH of BR_1.1, and so on.  All AP-ANH are exported into the IGP by
   their ASBRs.  Each ASBR advertises only one path per prefix to its
   RR, with the BGP-NH set to the appropriate AP-ANH.  The RR will
   propagate it through the entire AS by means of IBGP ADD-PATH.  In
   consequence, the number of paths learned per prefix is equal to
   number of ASBRs servicing a given peer AS.  In the network as of
   Figure 2, for AS2 prefixes, this would be N+N' (from site_1 + from
   site_2) paths per prefix.  This sets the scale requirements of this
   solution to be on par with Next-Hop-Self (Section 1.2.2).  However,
   thanks to the properties of ANH, more failures are covered by prefix-
   independent techniques, as withdrawal of the ANH from the IGP makes
   the BGP-NH unresolvable.

   Provided that all ASBRs in a given site (site1 in Figure 2) receive
   the same routing information from their peer AS (AS2), in non-faulty
   conditions, one could consider setting the ANH value on all ASBRs the
   same.  However, failure(s) can create situations when multiple ASBRs
   will have a session in ESTABLISHED state with a given peer AS, but



Szarecki, et al.         Expires August 10, 2019               [Page 10]

Internet-Draft      Abstract NH in scale-out peering       February 2019


   some prefixes would be learned from EBGP only on a subset of these
   ASBRs.  To prevent problems from arising in this situation, the per-
   ASBR AP-ANH needs to be advertised into the IGP and ASBRs need to set
   it as the BGP-NH when advertising routes to the site's Route
   Reflectors.  However, for IBGP path advertisement being propagated
   beyond the site (into the RR mesh), the BGP-NH may be replaced by
   another ANH value, the Site-Peer AS ANH.

3.2.  The Site-Peer AS Abstract Next Hop (SP-ANH)

   The AP-ANH works on an ASBR level.  From a given local AS
   perspective, the number of ANH is proportional to the number of pairs
   of ASBRs and ASes each of them peers with.  With hundreds of peer
   ASes, tens of sites and ~10 ASBRs per site, the number of AP-ANH may
   scale into the thousands.  At the same time, it may not be necessary
   or even desirable for every BGP speaker in the network to have
   visibility to every path down to individual egress ASBR granularity.
   With symmetrical multiplane backbone and/or leaf-spine designs, it is
   sufficient that BGP speakers on other sites have information that a
   given site (site1 in Figure 2) has at least one ASBR with an
   ESTABLISHED session to the peer AS (AS2).  For example, in the
   network of Figure 2, even if BR3.1 has only one path with its BGP-NH
   equal to the ANH of BR1.1, BR3.1 resolves the BGP-NH in the IGP and
   spreads traffic among all CRs on site 3.  Thus, traffic will be
   delivered to CR1.x at site 1.  As long as CR1.x has visibility to all
   paths, traffic will be distributed equally to all site 1 ASBRs.

   At the same time, when multiple paths are available on BGP speakers,
   every change is propagated, with consequent transmission and
   processing costs on all BGP speakers across the network.  This will
   be true even if the route change doesn't impact the forwarding plane.
   For example, in the network of Figure 2, even if BR3.1 has N paths
   with BGP-NHs set to the ANHs of BR1.1 through BR1.N, BR3.1 will
   resolve those BGP-NHs in the IGP and spread traffic among all CRs of
   site 3.  When one of the egress ASBRs (say BR1.2) loses its
   connectivity to the peer AS, the affected BGP routes (those with BGP-
   NH equal to AP-ANH of BR1.2) are withdrawn from all BGP speakers
   (e.g.  BR3.1) of the network.  All BGP speakers perform path
   selection and possibly update their forwarding data structures.
   Since the actual forwarding paths do not change, all this work
   represents unnecessary churn.

   To avoid the above drawbacks, the RR of a given site (site1 in
   Figure 2), when re-advertising a BGP path learned from its ASBR
   client, modifies the BGP-NH to another abstract value - the Site-Peer
   AS Abstract NH (SP-ANH).  This value is unique per (site, peer AS)
   pair, and is shared by all RRs of a given site.  With this
   modification, it is sufficient that inter-site IBGP sessions carry



Szarecki, et al.         Expires August 10, 2019               [Page 11]

Internet-Draft      Abstract NH in scale-out peering       February 2019


   only one path per prefix (no ADD-PATH needed).  Consequently, BGP RIB
   scale is reduced significantly.  This frees up memory, reduces the
   amount of data RRs need to exchange, and mitigates churn.  The BGP
   speakers in other sites of AS 1 need to resolve SP-ANH in order to
   build their local FIBs.  Therefore SP-ANH have to be present in the
   IGP - some router(s) in the local site (RR, ASBR or CR) need to
   inject it into the IGP.  While the selection of role that is
   responsible of SP-ANH injection is discussed below, in any case, the
   SP-ANH should be reachable in the IGP if, and only if, at least one
   of AP-ANH (for the same peer AS and ASBR belonging to given site) is
   reachable.  Figure 3 illustrates routing information flow in a
   network such as that of Figure 2:







































Szarecki, et al.         Expires August 10, 2019               [Page 12]

Internet-Draft      Abstract NH in scale-out peering       February 2019


                       +------------------------------------------------
                       |                            +----->IBGP to SITE2
                       |                  AS 1      | +--->IBGP to SITE3
   /=============================\                  | |
   |a.a.a.a/a                    |----------------->| |  SP-ANH
   |  as-path   "^2 .*"          |                  | |      (SITE1&AS2)
   |  BGP-NH    SP-ANH(SITE1&AS2)|                  | |  IP/32 into IGP
   \=============================/                  | |            ^
                       |                            | |            |
                       |  +-------------------------+-+------------+---+
   /==============================\    o------o   o-+-+--o             |
   |ADD-PATH                      |    |RR_1.2|   |RR_1.1|  SITE1      |
   |a.a.a.a/a                     |    o------O   o----X-O             |
   |  as-path   "^2 .*"           |                ^ ^  \              |
   |  BGP-NH    AP-ANH(BR_1.1&AS2)|               / /    \             |
   |a.a.a.a/a                     |--------------X-X---->|             |
   |  as-path   "^2 .*"           |             /  |     |             |
   |  BGP-NH    AP-ANH(BR_1.2&AS2)|            /   |     |             |
   \==============================/           /    |     |             |
   /==============================\          /     |     \             |
   |a.a.a.a/a                     |          |     |      \            |
   |  as-path   "^2 .*"           |--------->/     |       v           |
   |  BGP-NH    AP-ANH(BR_1.1&AS2)|         /      |   +------+        |
   \==============================/        /       |   |CR_1.1+--+     |
   /==============================\       /        /   +--+---+.1+-+   |
   |a.a.a.a/a                     |------X------->/       +-+----+X|   |
   |  as-path   "^2 .*"           |     /        /          +------+   |
   |  BGP-NH    AP-ANH(BR_1.2&AS2)| +------+ +------+       +------+   |
   \==============================/ |BR_1.1| |BR_1.2|-  -  -|BR_1.N|   |
                       |  |         +------+ +------+       +------+   |
                       |  |           ^  ^                             |
                       |  |            \  \                            |
                       |  +-------------X--X---------------------------+
   /======================\--------------X--X---------------------------
   |a.a.a.a/a             |               \  \
   |  as-path   "^2 .*"   |--------------->\  \---------\
   \======================/                 \            \
   /======================\                  \            \
   |a.a.a.a/a             |-------------------X----------->\
   |  as-path   "^2 .*"   |----------------+ +-X------------X-----------
   \======================/                | | +X-----+   +--X---+     +
                       |        AS 3       | | |PR_2.1|   |PR_2.2|- - -|
                       |                   | | +------+   +------+     +
                       |                   | |        AS 2
                       +-------------------+ +----a.a.a.a/a network-----


                                 Figure 3



Szarecki, et al.         Expires August 10, 2019               [Page 13]

Internet-Draft      Abstract NH in scale-out peering       February 2019


3.3.  Assignment of Abstract Next Hops

   In the following subsections we provide more details of how abstract
   next hops can be injected in several different common network
   architectures.

3.3.1.  Native IP Networks

   In this network every router, including core routers, has full BGP
   routing information and forwards each packet based on destination IP
   lookup.  Provided that all routers at an egress site receive multiple
   paths with BGP-NH set to AP-ANH (and not SP-ANH), it is a matter of
   the operator's decision which node - RR, ASBR or CR - will inject the
   SP-ANH route into the IGP.  One may argue that injection of SP-ANH by
   ASBRs may be simpler, as it will be done by the same procedure and
   policy as injection of AP-ANH.  Others may prefer injection at RR, as
   it limits the number of configuration touch-points.

3.3.2.  MPLS

3.3.2.1.  Identical BGP address space and paths received on all ASBRs

   In the MPLS network, since traffic is carried over LSP tunnels, the
   SP-ANH needs to be injected into the IGP by a node that has the
   ability to perform an IP lookup.  This eliminates the RR, and
   possibly CRs (in "BGP-free core" architectures).  Instead, all ASBRs
   are used to insert SP-ANH addresses into the IGP.  In case of LDP-
   based networks, this is sufficient.  The CR will create an ECMP
   forwarding structure for labels of SP-ANH FEC coming from other
   sites.  In RSVP-TE based networks, ECMP needs to happen on the
   ingress LSR and therefore, every BGP speaker needs to establish an
   LSP to every ASBR, and the SP-ANH address needs to be part of the FEC
   for its respective LSP.  If SP-ANH is used as an RSVP (signaling)
   destination, some other means (such as affinity groups) needs to be
   used to ensure the desired 1:1 LSP to egress ASBR mapping.

3.3.2.2.  Different address space sets or paths received on different
          ASBRs

   In the case when the set of prefixes received from a given peer AS by
   one ASBR is different from the set received by another one, a
   combination of SP-ANH and MPLS-based load balancing on a CR may lead
   to a situation where an IP packet will be directed to an ASBR that
   lacks external routing information and hence can't forward traffic
   directly out of the AS.  Similarly, if path attributes for a given
   prefix received by one ASBR are different from those received by
   another, again packets can be directed to the "wrong" ASBR.  In this
   case the ASBR would use the IBGP route it learned from another ASBR



Szarecki, et al.         Expires August 10, 2019               [Page 14]

Internet-Draft      Abstract NH in scale-out peering       February 2019


   of the same site (via RR, with AP-ANH) and forward traffic over an
   LSP to the "correct" ASBR.  This extra hop constitutes a sub-optimal
   traffic path through the network.

   For example in the network of Figure 2, let's assume that prefix P2
   is advertised to BR1.2-BR1.N by AS2 but not to BR1.1.  BR3.1 has a
   BGP best route to P2 with its BGP-NH set to the SP-ANH of (site1,
   AS2).  It resolves it by ECMP over N MPLS LSPs, terminating on
   BR1.1-BR1.N.  So, some packets are forwarded by BR3.1 over an LSP via
   CR1.x and terminated on BR1.1.  BR1.1 has no external route to P2,
   but it has (N-1) IBGP routes to P2 w/ BGP-NHs equal to the AP-ANHs of
   BR1.2-BR1.N.  Therefore BR1.1 performs an IP lookup and forwards this
   packet over LSPs via CR1.x and terminated on BR1.2-BR1.N.  Traffic is
   U-turned on BR1.1 and traverses CRs at site 1 twice.

   Such asymmetry may be considered acceptable by the provider, as long
   as it's a transient condition.  However, in the general case such a
   situation could be persistent, as the result of intentional
   configuration on the peer AS's ASBRs.  Therefore the better solution
   would be to insert the SP-ANH into the IGP on CRs.  In this case, CRs
   need to perform forwarding based on destination IP lookup.  Therefore
   CRs would have to be able to learn and handle large IP routing and
   forwarding tables - at least all prefixes learned from peer ASes by
   the local ASBRs.

3.3.3.  SPRING

3.3.3.1.  Identical BGP address space and path received on all ASBRs

   For SPRING based networks, we can take advantage of the unique
   capability of Anycast-SID [RFC8402].  The ASBRs of a single site
   allocate an Anycast-SID for each SP-ANH address.  This SID can be
   used as the only SID by an ingress BGP speaker or, if a TE routed
   path is desired, depending on TE constraints, the TE controller can
   provision a SPRING path with the Anycast-SID at the end, instructing
   the CR to perform load balancing among connected ASBRs.

3.3.3.2.  Different address space sets or paths received on different
          ASBRs

   Similarly to a classic MPLS environment, such a situation may lead to
   suboptimal routing (redirecting from one ASBR to another), or may
   require the CR (instead of ASBR) to insert the SP-ANH into the IGP
   and generate a PREFIX-SID (or Anycast-SID if there is more then one
   CR) for it.






Szarecki, et al.         Expires August 10, 2019               [Page 15]

Internet-Draft      Abstract NH in scale-out peering       February 2019


4.  Worked Examples

   Below we illustrate the operation of the proposal by working through
   its operation in the context of several different types of failures.
   Here, we assume that each ASBR in a given site of the local AS (site
   1 of AS1 in Figure 2), that has an EBGP session with the given peer
   AS (AS2 in Figure 2), receives from its peer routers (PR2.x) routes
   to exactly same address space on each session.

4.1.  Failure of a proper subset of EBGP sessions with a given peer AS
      on a single ASBR

   o  The impacted ASBR keeps advertising the AP-ANH into the IGP, as at
      least one session to the peer AS remains in the ESTABLISHED state.

   o  The impacted ASBR may send UPDATEs to RRs, however the BGP-NH
      remains the same and equal to the pre-failure AP-ANH.

   o  The RRs may send UPDATEs to their clients (CRs, BRs) and to RRs in
      other sites, however the BGP-NH remains the same as its pre-
      failure value: AP-ANH and SP-ANH respectively.

   o  As BGP-NH do not change, there are no changes in forwarding data
      structures (FIB) on any BGP speaker across the network, except
      possibly the ASBR that holds the impacted session.

4.2.  Failure of a proper subset of EBGP sessions with a given peer AS
      on each ASBR of a given site

   o  The impacted ASBRs keep advertising the AP-ANH into the IGP, as at
      least one session to the peer AS remains in the ESTABLISHED state
      on each ASBR.

   o  The impacted ASBRs may send UPDATEs to RRs, however the BGP-NH
      remains the same and equal to the pre-failure AP-ANH.

   o  The RRs may send UPDATEs to their clients (CRs, BRs) and to RRs in
      other sites, however the BGP-NH remains the same and equal to its
      pre-failure value: AP-ANH and SP-ANH respectively.

   o  As BGP-NH do not change, there are no changes in forwarding data
      structures (FIB) on any BGP speaker across the network, except
      possibly the ASBRs that hold the impacted sessions.








Szarecki, et al.         Expires August 10, 2019               [Page 16]

Internet-Draft      Abstract NH in scale-out peering       February 2019


4.3.  Failure of all EBGP sessions with a given peer AS on single ASBR;
      Failure of a single ASBR

   o  The impacted ASBR stops advertising the AP-ANH into the IGP, as it
      has lost all sessions with given peer AS.

   o  The SP-ANH is kept reachable in the IGP.

   o  All other BGP speakers at the impacted site invalidate all paths
      with BGP-NH equal to the AP-ANH.  This may trigger prefix-
      independent FIB data-structure patching/temporary fixing for sub-
      second traffic restoration.

   o  The impacted ASBR sends WITHDRAWs to its RRs.

   o  Each RR:

      *  Sends WITHDRAWs to its clients at the local site (CRs, BRs) for
         paths from the impacted ASBR.  As these sessions support ADD-
         PATH, paths from other ASBRs will remain.  Other BGP speakers
         at this site have to modify their FIBs.

      *  May send UPDATEs to RRs in other sites, however the BGP-NH
         remains the same, equal to the pre-failure SP-ANH.  As the BGP-
         NH does not change, there are no changes in forwarding data
         structure (FIB) on any of BGP speakers across network, except
         those at the impacted site.

   o  Routing churn is mitigated in many cases to a single peering site,
      and does not propagate across the network.  FIB changes are
      limited to a single peering site, and do not propagate across the
      network.

4.4.  All EBGP sessions with a given peer AS on all ASBRs

   o  Each ASBR stops advertising its AP-ANH into the IGP, as it has
      lost all sessions with the given peer AS.

   o  The SP-ANH is no longer reachable in the IGP, as none of AP-ANH
      are reachable.

   o  All other BGP speakers across the network invalidate all paths
      with a BGP-NH equal to the removed AP-ANH or SP-ANH.  This may
      trigger prefix-independent FIB data-structure patching/temporary
      fixing for sub-second traffic restoration.

   o  Each impacted ASBR sends WITHDRAWs to its RRs.




Szarecki, et al.         Expires August 10, 2019               [Page 17]

Internet-Draft      Abstract NH in scale-out peering       February 2019


   o  The RRs send WITHDRAWs to their clients at the local site (CRs,
      BRs) and RRs in other sites for paths from the impacted ASBRs.  As
      these sessions support ADD-PATH, paths from ASBRs at other sites
      will remain.  The BGP speakers across the network may need to
      modify their FIBs.

5.  Acknowledgements

   Valuable comments and suggestions on solution covered by this
   document was provided by Mannan Venkatesan, John Scudder and Ron
   Bonica.  Special thanks to John Scudder, who also helped with
   editorial changes.

6.  IANA Considerations

   This memo includes no request to IANA.

7.  Security Considerations

   Since this is a deployment architecture and not a protocol
   modification, it doesn't introduce any new issues to the BGP protocol
   itself.  General BGP security considerations are discussed in
   [RFC4271] and [RFC4272], BGP deployment best practices are documented
   in [RFC7454], and nothing in this proposal impedes their use.  Many
   of the practices recommended in that document are self-evidently
   still applicable, for example the use of cryptographic session
   protection methods such as TCP MD5 [RFC2385] or the TCP
   Authentication Option [RFC5925], and the Generalized TTL Security
   Mechanism [RFC5082].  Since we propose a novel use of IP addresses to
   assign ANHs, it's worth considering if anything new is required to
   protect them.  We conclude there isn't, they fall into the existing
   category of "Prefixes Belonging to the Local AS" discussed in section
   6.1.4 of [RFC7454].

8.  Informative References

   [I-D.ietf-rtgwg-bgp-pic]
              Bashandy, A., Filsfils, C., and P. Mohapatra, "BGP Prefix
              Independent Convergence", draft-ietf-rtgwg-bgp-pic-08
              (work in progress), September 2018.

   [RFC2385]  Heffernan, A., "Protection of BGP Sessions via the TCP MD5
              Signature Option", RFC 2385, DOI 10.17487/RFC2385, August
              1998, <https://www.rfc-editor.org/info/rfc2385>.







Szarecki, et al.         Expires August 10, 2019               [Page 18]

Internet-Draft      Abstract NH in scale-out peering       February 2019


   [RFC4271]  Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A
              Border Gateway Protocol 4 (BGP-4)", RFC 4271,
              DOI 10.17487/RFC4271, January 2006,
              <https://www.rfc-editor.org/info/rfc4271>.

   [RFC4272]  Murphy, S., "BGP Security Vulnerabilities Analysis",
              RFC 4272, DOI 10.17487/RFC4272, January 2006,
              <https://www.rfc-editor.org/info/rfc4272>.

   [RFC4456]  Bates, T., Chen, E., and R. Chandra, "BGP Route
              Reflection: An Alternative to Full Mesh Internal BGP
              (IBGP)", RFC 4456, DOI 10.17487/RFC4456, April 2006,
              <https://www.rfc-editor.org/info/rfc4456>.

   [RFC4724]  Sangli, S., Chen, E., Fernando, R., Scudder, J., and Y.
              Rekhter, "Graceful Restart Mechanism for BGP", RFC 4724,
              DOI 10.17487/RFC4724, January 2007,
              <https://www.rfc-editor.org/info/rfc4724>.

   [RFC5065]  Traina, P., McPherson, D., and J. Scudder, "Autonomous
              System Confederations for BGP", RFC 5065,
              DOI 10.17487/RFC5065, August 2007,
              <https://www.rfc-editor.org/info/rfc5065>.

   [RFC5082]  Gill, V., Heasley, J., Meyer, D., Savola, P., Ed., and C.
              Pignataro, "The Generalized TTL Security Mechanism
              (GTSM)", RFC 5082, DOI 10.17487/RFC5082, October 2007,
              <https://www.rfc-editor.org/info/rfc5082>.

   [RFC5925]  Touch, J., Mankin, A., and R. Bonica, "The TCP
              Authentication Option", RFC 5925, DOI 10.17487/RFC5925,
              June 2010, <https://www.rfc-editor.org/info/rfc5925>.

   [RFC7454]  Durand, J., Pepelnjak, I., and G. Doering, "BGP Operations
              and Security", BCP 194, RFC 7454, DOI 10.17487/RFC7454,
              February 2015, <https://www.rfc-editor.org/info/rfc7454>.

   [RFC7911]  Walton, D., Retana, A., Chen, E., and J. Scudder,
              "Advertisement of Multiple Paths in BGP", RFC 7911,
              DOI 10.17487/RFC7911, July 2016,
              <https://www.rfc-editor.org/info/rfc7911>.

   [RFC8402]  Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L.,
              Decraene, B., Litkowski, S., and R. Shakir, "Segment
              Routing Architecture", RFC 8402, DOI 10.17487/RFC8402,
              July 2018, <https://www.rfc-editor.org/info/rfc8402>.





Szarecki, et al.         Expires August 10, 2019               [Page 19]

Internet-Draft      Abstract NH in scale-out peering       February 2019


Authors' Addresses

   Rafal Jan Szarecki (editor)
   Juniper Networks Inc.
   1133 Innovation Way
   Sunnyvale, CA  94089
   US

   Phone: +1(408)680-9604
   Email: rafal@juniper.net


   Kaliraj Vairavakkalai
   Juniper Networks Inc.
   1133 Innovation Way
   Sunnyvale, CA  94089
   US

   Phone: +1(408)936-8872
   Email: kaliraj@juniper.net


   Natrajan Venkataraman
   Juniper Networks Inc.
   1133 Innovation Way
   Sunnyvale, CA  94089
   US

   Phone: +1(408)936-6597
   Email: natv@juniper.net





















Szarecki, et al.         Expires August 10, 2019               [Page 20]