Inter-Domain Routing H. Gredler
Internet-Draft Juniper Networks, Inc.
Intended status: Standards Track March 9, 2015
Expires: September 10, 2015

Prefix-SID extensions for BGP-LU
draft-gredler-idr-bgplu-prefix-sid-00

Abstract

The MPLS source routing paradigm provides path control for both intra- and inter- Autonomous System (AS) traffic. In most MPLS deployments the ingress of a MPLS tunnel is an IP router. Availability of MPLS forwarding stacks for host operating systems is extending the MPLS perimeter to Hypervisors and Servers. Recent Data Center designs are using an IGP-less routing paradigm based on massive ECMP multi path using external BGP. This documents outlines how Hypervisors and Servers may interact with the MPLS control- and data plane using extensions to the BGP labeled unicast protocol (BGP-LU).

Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on September 10, 2015.

Copyright Notice

Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.


Table of Contents

1. Introduction

Recent Datacenter routing designs are modeled like shown in Figure Figure 1. Rather than using an IGP plus internal BGP (iBGP), an IGP-less design is favored for disseminating routing information. See [I-D.ietf-rtgwg-bgp-routing-large-dc] for rationale and detailed information why and how to do so. Today BGP-LU [RFC3107] is used both as an intra-AS [I-D.ietf-mpls-seamless-mpls] and inter-AS routing protocol. Because of the IGP-less routing paradigm topology information gets lost. Particularly the ability to direct traffic to a specific node and hence the ability to do construct explicit paths denominated by a set of nodes for traffic-engineering is of interest. BGP-LU today may advertise a MPLS transport path between Autonomous Systems. This document describes extensions to the BGP-LU protocol such that in addition to the advertised MPLS label-switched paths (LSP) all potential MPLS label-switched paths of any given node in the Data Center are exposed to ingress nodes.

The protocol extensions In this document are in full compliance with the MPLS Architecture documented in [RFC3031].

             +------+  +------+
             |      |  |      |
             |      |--|      |           Tier-1 / AS 651xx
             |      |  |      |
             +------+  +------+
               |  |      |  |
     +---------+  |      |  +----------+
     | +-------+--+------+--+-------+  |
     | |       |  |      |  |       |  |
   +----+     +----+    +----+     +----+
   |    |     |    |    |    |     |    |
   |    |-----|    |    |    |-----|    | Tier-2 / AS 652xx
   |    |     |    |    |    |     |    |
   +----+     +----+    +----+     +----+
      |         |          |         |
      |         |          |         |
      | +-----+ |          | +-----+ |
      +-|     |-+          +-|     |-+    Tier-3 / AS 653xx
        +-----+              +-----+
         | | |                | | |
     <- Servers ->        <- Servers ->  Servers / AS 65534
	

Figure 1: eBGP-centric Data Center routing

2. Motivation, Rationale and Applicability

The specifications for Segment Routing ( [I-D.ietf-isis-segment-routing-extensions] and [I-D.ietf-ospf-segment-routing-extensions] ) provide extensions for setting up hop-by-hop shortest path routed MPLS LSPs. The used Protocol semantics are:

advertised by any router in an IGP domain. This not only sets up MPLS sink-trees to each egress router in a domain, but also allows to steer traffic using stacks of node labels. The chosen protocol semantics are essentially a compression scheme to advertise all MPLS SPT paths in a domain.

The ability to do explicit path routing based on stacked labels, constructed at the Hypervisors/Servers, without running conventional TE-protocols like for example RSVP-TE is a lightweight way to scale the Data Center Fabric.

In order to support deployments of Segment Routing across routing protocol boundaries it is required to keep a common set of semantics across all routing protocols. This document specifies BGP-LU extensions to be able to address Node-SIDs across routing-protocol boundaries.

3. Deployment Considerations

Depending on the Sophistication of the MPLS stack at the Hypervisor / Server there are various levels of considerations for deployment.

3.1. Control plane restart

In case a restart of the first-hop router needs to be performed there may be some forwarding state churn at the Hypervisor / Server. It would be desirable that upon control-plane restart the Network node uses the same label-allocations than in the previous incarnation. Unfortunately none of the BGP graceful restart extensions allows to re-aquire previous incarnations label-mapping state from the network. Therefore a restarting node will be allocating FECs to labels in temporal incoming order. This degrades to pseudo-random, non-predictable label allocations. It is desirable that a BGP-LU implementation allocates the labels in a deterministic way, such that temporal control-plane loss does not impact forwarding between the Hypervisor / Server and the network.

A BGP-LU Prefix SID speaking networking node MUST therefore implement a MPLS label-allocation strategy which produces a deterministic, local allocated label-block for all of its Prefix SIDs.

For example an Implementation MAY statically allocate a Label Base of 800000 and a block-size of 16000 labels and delegate that label block exclusively to BGP-LU Prefix SID allocations, such that the same label-base is being used across control-plane restarts.

3.2. BGP-LU as Server Control Plane

In this case the Hypervisor / Server has a "client-only" BGP-LU stack in order to interface to the network. This is the most distributed way of building label switched paths across the network. As soon as there is a reachability change then all of the Hypervisors / Servers get notified instantly. There is almost no time-lag for updating servers due to the inherent PUSH model of the BGP Protocol.

Most of the implementation complexity of a BGP implementation comes from the BGP Update generation subsystem. For a client-only BGP implementation this is fortunately negligible as typically one or two (for redundancy reasons) BGP sessions are required. So the BGP Update Generation complexity stays limited.

3.3. Labeled-ARP as Server Control Plane

The Labeled ARP Protocol [I-D.kompella-mpls-larp] may be used as a lightweight alternative to the BGP-LU protocol. Labeled ARP is a soft-state protocol and therefore needs special consideration for e.g Refresh-timers, Labels in the network etc. needs to be taken. Yet it is a distributed variant of LSP state propagation and hence re-acts immediately to network topology changes / label to FEC changes.

3.4. Static Labels and Controller as Server Control Plane

Static labels do not need a control-plane sessions between Hypervisors / Servers and the network. The assumption is that an external controller transfers the routing/label information into the Hypervisor / Server. The main disadvantage of that model is that the update process is not distributed and hence a controller needs to have excellent horizontal scaling abilities in order to update order of 100K routes/labels to order of 100K servers.

4. BGP Prefix-SID Attribute

In order to facilitate dense packing of Network nodes and Node labels to a deterministic label-range like described in Section 3.1 a new Protocol extension called the "BGP Prefix SID Attribute" is proposed.

The BGP Prefix SID is a new optional, transitive BGP path attribute. The attribute type code for BGP Prefix SID attribute is to be assigned by IANA.

The value field of the BGP Prefix SID attribute is defined here to be a set of elements encoded as "Type/Length/Value" (i.e., a set of TLVs). Each such TLV is encoded as shown in Figure Figure 2.

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |       Type    |               Length          |               |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
  ~                                                               ~
  |                         Value (variable)                      |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            

Figure 2: TLV format

  • Type: A single octet encoding the TLV Type. Unrecognized Types are preserved and propagated. In order to compare NLRIs with unknown TLVs all TLVs MUST be ordered in ascending order by TLV Type. If there are more TLVs of the same type, then the TLVs MUST be ordered in ascending order of the TLV value within the TLVs with the same type. All TLVs that are not specified as mandatory are considered optional.
  • Length: Two octets encoding the length of the value portion in octets (thus a TLV with no value portion would have a length of zero). The TLV is not padded to four-octet alignment.
  • Value: A field containing zero or more octets.

The following TLV types are defined in this document:

Prefix SID TLVs
Type Name
1 Label Index
2 Label Base
3 Label Range

Use of other TLV types is outside the scope of this document.

4.1. Label Index TLV

  • Type: 1
  • Length: 4
  • Value: Label Index

Only one Label Index TLV per Prefix SID Attribute is allowed.

4.2. Label Base TLV

  • Type: 2
  • Length: 3
  • Value: Label Base

One or more occurences of the Label Base TLV are allowed. A Label Base TLV MUST be followed by a Label Range TLV.

4.3. Label Range TLV

  • Type: 3
  • Length: 3
  • Value: Label Range

One or more occurences of the Label Range TLV are allowed. A Label Range TLV MUST be preceeded by a Label Range TLV.

5. Acknowledgements

Many thanks to TBD for their detailed review and insightful comments.

6. IANA Considerations

This document requests a code point from the BGP Path Attributes registry named 'Prefix SID'

This document requests creation of a new registry for BGP Prefix SID TLVs. Value 0 is reserved. The maximum value is 255. The registry will be initialized as shown in Table 1. Allocations within the registry will require documentation of the proposed use of the allocated value (=Specification required) and approval by the Designated Expert assigned by the IESG (see [RFC5226]).

7. Security Considerations

This document does not introduce any change in terms of BGP security.

8. References

8.1. Normative References

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC3031] Rosen, E., Viswanathan, A. and R. Callon, "Multiprotocol Label Switching Architecture", RFC 3031, January 2001.
[RFC3107] Rekhter, Y. and E. Rosen, "Carrying Label Information in BGP-4", RFC 3107, May 2001.
[RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 5226, May 2008.

8.2. Informative References

[I-D.ietf-isis-segment-routing-extensions] Previdi, S., Filsfils, C., Bashandy, A., Gredler, H., Litkowski, S., Decraene, B. and J. Tantsura, "IS-IS Extensions for Segment Routing", Internet-Draft draft-ietf-isis-segment-routing-extensions-03, October 2014.
[I-D.ietf-mpls-seamless-mpls] Leymann, N., Decraene, B., Filsfils, C., Konstantynowicz, M. and D. Steinberg, "Seamless MPLS Architecture", Internet-Draft draft-ietf-mpls-seamless-mpls-07, June 2014.
[I-D.ietf-ospf-segment-routing-extensions] Psenak, P., Previdi, S., Filsfils, C., Gredler, H., Shakir, R., Henderickx, W. and J. Tantsura, "OSPF Extensions for Segment Routing", Internet-Draft draft-ietf-ospf-segment-routing-extensions-04, February 2015.
[I-D.ietf-rtgwg-bgp-routing-large-dc] Lapukhov, P., Premji, A. and J. Mitchell, "Use of BGP for routing in large-scale data centers", Internet-Draft draft-ietf-rtgwg-bgp-routing-large-dc-01, February 2015.
[I-D.kompella-mpls-larp] Kompella, K., Rajagopalan, B. and G. Swallow, "Label Distribution Using ARP", Internet-Draft draft-kompella-mpls-larp-02, October 2014.

Author's Address

Hannes Gredler Juniper Networks, Inc. 1194 N. Mathilda Ave. Sunnyvale, CA 94089 US EMail: hannes@juniper.net