Network Working Group | R. White |
Internet-Draft | S. Zandi |
Intended status: Informational | |
Expires: September 4, 2017 | March 3, 2017 |
OpenFabric
draft-white-openfabric-00
Spine and leaf topologies are widely used in hyperscale and cloud scale networks. In most of these networks, configuration is automated, but difficult, and topology information is extracted through broad based connections. Policy is often integrated into the control plane, as well, making configuration, management, and troubleshooting difficult. OpenFabric is an adaptation of an existing, widely deployed link state protocol, Intermediate Sytem to Intermediate System (IS-IS) that is designed to:
This document begins with an overview of OpenFabric, including a description of what may be removed from IS-IS to enable scaling. The document then describes an optimized adjacency formation process; an optimized flooding scheme; some thoughts on the operation of OpenFabric, metrics, and aggregation; and finally a description of the changes to the IS-IS protocol required for OpenFabric.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 4, 2017.
Copyright (c) 2017 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
Spine and leaf fabrics are often used in large scale data centers; in this application, they are commonly called a fabric because of their regular structure and predicitable forwarding and convergence properties. This document descibes modifications to the IS-IS protocol to enable it to run efficiently on a large scale spine and leaf fabric, OpenFabric. The goals of this control plane are:
In building any scalable system, it is often best to begin by removing what is not needed. In this spirit, OpenFabric implementations MAY remove the following from IS-IS:
To create a scalable link state fabric, OpenFabric includes the following:
OpenFabric implementations:
OpenFabric implementations MUST NOT be mixed with standard IS-IS implementations in operational deployments. OpenFabric and standard IS-IS implementations SHOULD be treated as two separate protocols.
The following spine and leaf fabric will be used to describe these modifications.
+----+ +----+ +----+ +----+ +----+ +----+ | 1A | | 1B | | 1C | | 1D | | 1E | | 1F | (T0) +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | 2A | | 2B | | 2C | | 2D | | 2E | | 2F | (T1) +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | 3A | | 3B | | 3C | | 3D | | 3E | | 3F | (T2) +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | 4A | | 4B | | 4C | | 4D | | 4E | | 4F | (T1) +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | 5A | | 5B | | 5C | | 5D | | 5E | | 5F | (T0) +----+ +----+ +----+ +----+ +----+ +----+
Figure 1
To reduce confusion (spine and leaf fabrics are difficult to draw in plain text art), this diagram does not contain the connections between devices. The reader should assume that each device in a diven layer is connected to every device in the layer above it. For instance:
The tiers or stages of the fabric are also marked for easier reference. T0 is assumed to be connected to application servers, or rather they are Top of Rack (ToR) routers. The remaining tiers, T1 and T2, are connected only to the fabric itself. Note there are no "cross links," or "east west" links in the illustrated fabric. The fabric locality detection mechanism described here will not work if there are cross links running east/west through the fabric. Locality detection may be possible in such a fabric; this is an area for further study.
The authors would like to thank Nick Russo, Nikos Triantafillis, Rodny Molina, and Ivan Pepelnjak for their comments and review of the concepts and text of this document.
While adjacency formation is not considered particularly burdensome in IS-IS, it is still useful to reduce the amount of state transferred across the network when connecting a new router to the fabric. Any such optimization is bound to present a tradeoff between several factors; the mechanism described here increases the amount of time required to form adjacencies slightly in order to reduce the total state carried across the network. The process is:
This process allows each IS newly added to the fabric to exchange a full table once; a very minimal amount of information will be transferred with the remaining neighbors to reach full synchronization.
The tier to which a router is connected is useful to enable autoconfiguration of routers connected to the fabric, and to reduce flooding. This section describes mechanisms for determining the tier at which a router is connected in the fabric in several steps. The first step is to find the Farthest Distance (FD) and the Total Distance (TD), which are useful in this process. To find the FD and TD:
If FD == TD == 2, this is a three stage fabric; it is not possible to determine the tier at which the local node is located based on any calculation, because the topology is perfectly symmetric. In this case:
If FD == TD, and TD >= 4, this is a greater than three stage fabric; the local device SHOULD advertise 0x00 in its IS reachability tier sub-TLV.
For instance, in the diagram above, 1A would:
If FD == TD == 2, this is a three stage fabric; it is not possible to determine the tier at which the local node is located based on any calculation, because the topology is perfectly symmetric. In this case:
If TD != FD, this is a greater than three stage fabric; the local device SHOULD advertise (TD - FD) in its IS reachability tier sub-TLV.
For example, in the above five stage fabric, 3B would:
Flooding is perhaps the most challenging scaling issue for a link state protocol running on a dense, large scale fabric. To reduce flooding, OpenFabric takes advantage of information already available in the link state protocol, the list of the local intermediate system's neighbor's neighbors, and the fabric locality computed above. The following tables are required to compute a set of reflooders:
NL is set to contain all neighbors, and sorted deterministically (for instance, from the highest router ID to the lowest). All intermediate systems within a single fabric SHOULD use the same mechanism for sorting the NL list. NN is set to contain all neighbor's neighbors, or all intermediate systems that are two hops away, as determined by performing a truncated SPF. The DNR and RF tables are initially empty. To begin:
Then, for every IS in NL:
When flooding, LSPs transmitted to adjacent neighbors on the RF list will be transmitted normally. Adjacent intermediate systems on this list will reflood received LSPs into the next stage of the topology, ensuring database synchronization. LSPs transmitted to adjacent neighbors on the DNR list, however, will have the DNR bit the optional flooding sub-TLV (see the packet format modifications and TLVs below).
Any IS receiving an LSP with the DNR bit set will not set the Send Route Message (SRM) flag on any interface for this LSP; hence the LSP will not be reflooded by this IS to any adjacent neighbor. This reduces flooding to the minimum possible while retaining full Link State Database (LSDB) synchronization.
In data center fabrics, ToR routers SHOULD NOT be used to transit between two T1 (or above) spine routers. The simplest way to prevent this is to set the overload bit [RFC3277] for all the LSPs originated from T0 routers. However, this solution would have the unfortunate side effect of causing all reachability beyond any T0 router to have the same metric, and many implementations treat a set overload bit as a metric of 0xFFFF in calculating the Shortest Path Tree (SPT). This document proposes an alternate solution which preserves the leaf node metric, while still avoiding transiting T0 routers.
Specifically, all T0 routers SHOULD advertise their metric to reach any T1 adjacent neighbor with a cost of 0XFFE. T1 routers, on the other hand, will advertise T0 routers with the actual interface cost used to reach the T0 router. Hence, links connecting T0 and T1 routers will be advertised with an assymetric cost that discourages transiting T0 routers, while leaving reachability to the destinations attached to T0 devices the same.
While aggregation is not recommended in OpenFabric deployments, aggregation MAY take place when routing information is being transmitted from higher level tiers to lower level tiers. For instance, in the example network, 2A through 2F could advertise a single default route to 1A through 1F. 2A through 2F would simply advertise the default as if it were an attached to each router locally using either a type 135 or 236 TLV, and then block TLVs that contain reachability information (such as types 135 and 236). Type 22 TLVs, however, MUST be flooded through this boundary, so that every router in the network shares a common view of the topology.
Note that aggregation in a DC fabric can result in routing black holes in some cases, and also possibly reduce the efficiency of traffic engineering in the network.
A new sub-TLV is added to the type 22 TLV to indicate tier level, as follows:
The tier identifier field contains the tier number of the local router as calculated using the process above. If the tier number is unknown, the sub-TLV MUST be included with a tier ID of 0xFF, which indicates the advertising router does not have enough information to calculate its tier number, or there is some error in calculating a tier number.
For OpenFabric implementations, the Partition Repair in the LSP PDU header SHALL be treated as the Do Not Reflood (DNR) bit. Any IS receiving an LSP with the DNR bit set SHOULD NOT set the SRM flag for the LSP, so the LSP will not be flooded to adjacent routers.
This document outlines modifications to the IS-IS protocol for operation on large scale data center fabrics. While it does add new TLVs, and some local processing changes, it does not add any new security vulnerabilities to the operation of IS-IS. However, OpenFabric implementions SHOULD implement IS-IS cryptographic authentication, as described in [RFC5304], and should enable other security measures in accordance with best common practices for the IS-IS protocol.
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997. |
[RFC2629] | Rose, M., "Writing I-Ds and RFCs using XML", RFC 2629, DOI 10.17487/RFC2629, June 1999. |
[RFC5301] | McPherson, D. and N. Shen, "Dynamic Hostname Exchange Mechanism for IS-IS", RFC 5301, DOI 10.17487/RFC5301, October 2008. |
[RFC5303] | Katz, D., Saluja, R. and D. Eastlake 3rd, "Three-Way Handshake for IS-IS Point-to-Point Adjacencies", RFC 5303, DOI 10.17487/RFC5303, October 2008. |
[RFC5305] | Li, T. and H. Smit, "IS-IS Extensions for Traffic Engineering", RFC 5305, DOI 10.17487/RFC5305, October 2008. |
[RFC5308] | Hopps, C., "Routing IPv6 with IS-IS", RFC 5308, DOI 10.17487/RFC5308, October 2008. |
[RFC5311] | McPherson, D., Ginsberg, L., Previdi, S. and M. Shand, "Simplified Extension of Link State PDU (LSP) Space for IS-IS", RFC 5311, DOI 10.17487/RFC5311, February 2009. |