Internet DRAFT - draft-heitz-idr-msdc-bgp-aggregation
draft-heitz-idr-msdc-bgp-aggregation
IDR J. Heitz
Internet-Draft D. Rao
Intended status: Standards Track Cisco
Expires: April 25, 2019 October 22, 2018
Aggregating BGP routes in Massive Scale Data Centers
draft-heitz-idr-msdc-bgp-aggregation-00
Abstract
A design for a fabric of switches to connect up to one million
servers in a data center is described. At that scale, it is
impractical for every switch to maintain knowledge about every other
switch and every other link in the fabric. Aggregation of routes is
an excellent way to scale such a fabric. However, aggregation
presents some problems under link failures or switch failures. This
design solves those problems.
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on April 25, 2019.
Copyright Notice
Copyright (c) 2018 IETF Trust and the persons identified as the
document authors. All rights reserved.
Heitz & Rao Expires April 25, 2019 [Page 1]
Internet-Draft MSDC BGP Aggregation October 2018
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Solution Overview . . . . . . . . . . . . . . . . . . . . . . 3
3. Problems with negative routes . . . . . . . . . . . . . . . . 4
4. Use of a negative route in BGP . . . . . . . . . . . . . . . 4
5. Implementation Notes to Reduce CPU Time Consumption . . . . . 5
6. Smooth Startup and Avoidance of Too Many Negative Routes . . 5
7. Avoidance of Transients . . . . . . . . . . . . . . . . . . . 6
8. Configuration . . . . . . . . . . . . . . . . . . . . . . . . 7
9. South Triggered Automatic Disaggregation (STAD) . . . . . . . 7
10. Configuration for STAD . . . . . . . . . . . . . . . . . . . 8
11. Security Considerations . . . . . . . . . . . . . . . . . . . 9
12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9
13. Acknowldgements . . . . . . . . . . . . . . . . . . . . . . . 9
14. References . . . . . . . . . . . . . . . . . . . . . . . . . 9
14.1. Normative References . . . . . . . . . . . . . . . . . . 9
14.2. Informative References . . . . . . . . . . . . . . . . . 9
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9
1. Introduction
[RFC7938] defines a massive scale data center as one that contains
over one hundred thousand servers. It describes the advantages of
using BGP as a routing protocol in a Clos switching fabric that
connects these servers. It laments the need to announce all routes
individually, because of the problems associated with route
aggergation. A fabric design that scales to one million servers is
considered enough for the forseeable future and is the design goal of
this document. Of course, the design should also work for smaller
fabrics.
A switch fabric to connect one million servers will consist of
between 35000 and 130000 switches and 1.5 million to 8 million links,
depending on how redundantly the servers are connected to the fabric
and the level of oversubscription in the fabric. A switch that needs
to store, send and operate on hundreds of routes is clearly cheaper
than one that needs to store, send and operate on millions of links.
Heitz & Rao Expires April 25, 2019 [Page 2]
Internet-Draft MSDC BGP Aggregation October 2018
A switch running BGP and aggregating its routes needs to send only
one route. In the ideal case, each switch receives just one route
from each of its neighbors. For each link or a neighbor that fails,
the switch should send just one extra route. No single link failure
needs to be known by every switch in the fabric and some switch
failures do not need to be known by every switch either. The routes
that advertise these failures should only propagate to those switches
that need to know about them. During normal operation, the number of
failures are few, so the number of advertisements are few.
A route that advertises a failure is called a negative route.
Negative routes are not a new idea, but they are unpopular, because
they cause a number of problems. This document solves the problems.
2. Solution Overview
In a Clos network all northbound links can reach all destinations and
there is typically only one or very few southbound links to reach any
specific destination. Therefore, traffic from source to destination
is spread to all available northbound links, reaches all the spines
and then concentrates southbound towards its destination. When a
link fails, then a spine will lose connectivity to some southbound
destiunations. That means any northbound link to that spine also
loses connectivity to the same destinations.
When the fabric is fully connected with no failed links, then the
forwarding tables in the switches can simply contain multipath
aggreagate routes to all the northbound links. Each of the multipath
routes is the same, so traffic is spread out smoothly among these
routes. As soon as a link fails, the forwarding tables must exclude
the resultant unreachable destinations from some of the northbound
links. The way to do that is to add specific routes for the failed
destinations to point at the remaining links that can reach those
destinations. Since traffic will always prefer specific routes to
aggregate routes, the traffic to the failed destinations will no
longer take the aggregate routes.
Two methods to create these specific routes are described. One way
is to send a negative route from the point where the failure is
detected. Receivers use the negative route to punch holes out of the
aggregate routes and create the specific routes by subtracting the
negative route from the aggregates. This method is described
starting at section 4. The other method creates the specific routes
at the point of the failure and announces them in BGP. This method
is described starting at section 9.
Heitz & Rao Expires April 25, 2019 [Page 3]
Internet-Draft MSDC BGP Aggregation October 2018
3. Problems with negative routes
- Massive failures can cause lots of negative routes and overwhelm
the switches.
- In order for a switch to know what has failed, it must know what
is supposed to be up. For it to know this requires either an
error prone algorithm or an error prone configuration.
- During certain network events that cause multiple routes to be
sent and/or withdrawn, the messages may race each other and cause
transient loss of connectivity to paths that were otherwise
unaffected by the event. This occurs in link state routing
protocols as well.
- Computation of forwarding table entries may consume a lot of CPU
time in pathological cases. However, even in pathological cases,
this is still much less CPU time than it takes to compute an SPF
in a million links.
4. Use of a negative route in BGP
Three new BGP well known communities are defined:
- Hole-Punch: A route with this community can punch a hole out of
another route with a shorter netmask that covers the address
space of this route.
- Punch-Accept: A route with this community can have holes punched
out of it by hole punch routes.
- Do-not-Aggregate; Do not aggregate this route.
A fabric switch will aggregate routes learnt from neighbors to its
south. It must know all the routes that are expected to complete the
aggregate. It will announce the aggregate with the Punch-Accept
community. If any of the routes that are expected to complete the
aggregate are missing, then it will announce those missing routes
with the Hole-Punch and Do-not-Aggregate communities along with the
aggregate route.
A receiver of a route with the Hole-Punch community will give it a
lower than normal local preference and will search the BGP table for
other routes with the following properties:
- a shorter netmask than this route,
- covers the address space of this route,
Heitz & Rao Expires April 25, 2019 [Page 4]
Internet-Draft MSDC BGP Aggregation October 2018
- has the Punch-Accept community,
- is installed in the Routing Table.
This is the candidate set. Then, it will remove any routes that have
a shorter netmask than the route with the longest netmask in the set.
The final candidate set of routes will all have the same prefix. For
each route in the candidate set, BGP will create a new route with the
same prefix as the Hole-punch route and the same attributes as the
Punch-Accept route. This new route is called a chad route. If a
route has an MPLS label, then the label is considered part of the
attributes, not part of the prefix.
Chad routes will take part in bestpath and multipath selection. If a
chad route becomes a bestpath or a multipath, it will be installed in
the Routing Table. However, chad routes are not advertised by
default. That means if a chad route is bestpath and other routes
exist for the same prefix, then no route is advertised for that
prefix.
If a chad route has the same nexthop (and MPLS label, if labels are
used) as a hole-punch route of the same prefix, then the chad route
becomes hidden. Hidden means that it cannot take part in route
selection.
5. Implementation Notes to Reduce CPU Time Consumption
This section is not normative.
When a Punch-Accept route is received, BGP needs to scan a subtree of
the BGP prefix table rooted at the prefix of the Punch-Accept route
to look for Hole-Punch routes that might create chad routes from it.
That subtree could be large. To reduce the number of routes to scan,
a separate prefix table is created to store copies of the Hole-Punch
routes. The number of Hole-Punch routes is expected to be much
smaller than the total number of routes. That makes the scan much
quicker. The Hole-Punch routes must additionally be stored in the
regular BGP route table.
6. Smooth Startup and Avoidance of Too Many Negative Routes
When several switches of a data center fabric start up at the same
time, many negative routes can be transiently created before the
whole system is up.
When the BGP process starts, it will typically start in receive-only
mode for some time, then perform route selection and send out it's
own updates. To ensure a smooth startup of the data center when many
Heitz & Rao Expires April 25, 2019 [Page 5]
Internet-Draft MSDC BGP Aggregation October 2018
nodes start at the same time, the startup sequence is modified as
follows.
- All BGP speakers SHOULD send EOR after sending all routes after
the BGP session becomes established.
- When all southbound configured BGP neighbors have sent their EOR,
the BGP speaker will perform route selection and send all updates
to the northbound neighbors and then send EOR. If some
southbound neighbors cannot establish, a timer will be used to
prevent waiting forever.
- After the previous step completes, when all northbound configured
BGP neighbors have sent their EOR, the BGP speaker will perform
route selection and send all updates to the southbound neighbors
and then send EOR. If some northbound neighbors cannot
establish, a timer will be used to prevent waiting forever.
- If the number of received negative routes causes too many
forwarding entries, then BGP can look for aggreagate routes that
are accompanied by many hole-punch routes and invalidate some of
the aggregate routes and their accompanying hole-punch routes.
If the number of received negative routes is too many to hold in
the BGP table, then BGP can shut down neighbor sessions that are
sending the most negative routes.
7. Avoidance of Transients
If one event were to cause both an aggregate and a hole-punch route
to be announced at the same time, but the hole-punch route were to
arrive late, a transient could result. The following rules prevent
that.
- It is common practice for aggregate routes to be withdrawn when
no components of the aggregate exist. Hole-Punch routes need to
always be announced, even if the aggregate is not.
- After a BGP session establishes, no routes that are received from
it should be installed in the RIB until the EOR is received from
that session.
- If overlapping hole-punch routes need to be updated and
withdrawn, then the updates must be sent before the withdraws.
- If overlapping hole-punch and punch-accept routes need to be
updated, then the hole-punch routes must be updated first.
Heitz & Rao Expires April 25, 2019 [Page 6]
Internet-Draft MSDC BGP Aggregation October 2018
- If overlapping hole-punch and punch-accept routes need to be
withdrawn, then the punch-accept routes must be withdrawn first.
8. Configuration
All the BGP sessions need to be configured on each switch. The BGP
sessions need to be configured as northbound or southbound. The
routes that are expected to complete an aggregate route must be
configured.
A companion document describes a protocol that can discover and
configure the entire fabric. If that companion document is used,
then no IP addresses or tier designations or any other location
dependent configuration is required on the switches.
9. South Triggered Automatic Disaggregation (STAD)
In this method, a node that is south of a failed link or node
announces its prefix(es) along alternative links with a hint to
trigger automatic disaggregation or inhibit their suppression on
upstream tier nodes. These disaggregated or unsuppressed routes
traverse along redundant paths and disjoint planes to switches in
other clusters in the topology where they are used in forwarding.
The hint is in the form of a well known BGP community. A few new
well known communities are used in this scheme.
- Do-not-Aggregate : Do not aggregate this route
- Tier : An Extended Community identifying the tier of the
originated route
- Dis-Aggregate : Triggers announcement of more specific routes at
receiving node
The techniques in this draft assume a CLOS topology of the form
described in [Figure 3] of RFC7938 where an access switch such as a
TOR forms the lowest tier and is connected to multiple northbound
upper tier switches; which in turn are connected to multiple upper
tier switches, forming disjoint planes across the topology with fan-
outs.
A figure illustrating the topology will be added in a subsequent
version.
Upon a link failure, the node south of the failure announces its
prefix to it's other northbound BGP sessions with the Do-Not-
Advertise community.
Heitz & Rao Expires April 25, 2019 [Page 7]
Internet-Draft MSDC BGP Aggregation October 2018
A higher tier node that receives a route with a Do-not-Aggregate
community will not suppress this route when there is a local covering
aggregate, but will propagate it further as is.
This procedure enables the more specific route to reach the
appropriate tier switches in other clusters where the topology fans
out on multiple northbound links. The received paths for the more
specific prefix form a multipath excluding the links which would lead
to failed paths in the topology.
A route that is advertised with the Do-Not-Aggregate community as per
this section will also add a Tier extended community. If this
extended community is present, then the Do-Not-Aggregate community is
only applicable at tiers that are more north than the tier indicated
in the extended community.
The Tier Extended Community ensures that the unsuppressed specific
routes do not propagate further beyond the corresponding fan-out
points in the other clusters.
If all the northbound links or BGP sessions at a node have failed,
then the node will announce its southbound route with the Dis-
Aggregate community. This signals all it's south-side nodes to
advertise their north-bound routes with the Do-Not-Aggregate
community along the other north-bound links.
These techniques are applicable to any tier in the topology.
At the lowest tier, if there are servers that are attached to more
than one fabric switch (eg. TOR), then the host routes (or
configured more-specific routes) for the server are not aggregated by
the TOR to it's connected upper tier switches. In this case, these
routes are aggregated by the upper-tier switches towards the rest of
the topology.
10. Configuration for STAD
Each switch has the notion of northbound and southbound sessions or
links. In addition, it is assigned to a tier in the hierarchy. The
switch uses this configuration to drive the procedures described in
the section above. A switch at the lowest tier (eg. a TOR) will have
server subnet prefixes configured. Switches at higher tiers have
aggregates configured.
Heitz & Rao Expires April 25, 2019 [Page 8]
Internet-Draft MSDC BGP Aggregation October 2018
11. Security Considerations
TBD
12. IANA Considerations
TBD
13. Acknowldgements
14. References
14.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
14.2. Informative References
[RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of
BGP for Routing in Large-Scale Data Centers", RFC 7938,
DOI 10.17487/RFC7938, August 2016,
<https://www.rfc-editor.org/info/rfc7938>.
Authors' Addresses
Jakob Heitz
Cisco
170 West Tasman Drive
San Jose, CA, CA 95134
USA
Email: jheitz@cisco.com
Dhananjaya Rao
Cisco
170 West Tasman Drive
San Jose, CA, CA 95134
USA
Email: dhrao@cisco.com
Heitz & Rao Expires April 25, 2019 [Page 9]