Internet DRAFT - draft-fang-idr-bgplu-for-hsdn
draft-fang-idr-bgplu-for-hsdn
IDR Luyuan Fang
Internet Draft Deepak Bansal
Intended status: Standards Track Microsoft
Expires: July 26, 2016 Chandra Ramachandran
Juniper Networks
Fabio Chiussi
Nabil Bitar
Verizon
Yakov Rekhter
January 23, 2016
BGP-LU for HSDN Label Distribution
draft-fang-idr-bgplu-for-hsdn-03
Abstract
This document describes modifications of BGP Labeled Unicast (BGP-LU)
procedures for label distribution in a partitioned network.
Specifically, these procedures are suitable for building the
Hierarchical SDN (HSDN) control plane for the hyper-scale Data Center
(DC) and cloud networks.
Status of this Memo
This Internet-Draft is submitted to IETF in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
Copyright Notice
Fang et al. Expires <July 26, 2016> [Page 1]
Internet-Draft BGP-LU for HSDN January 23, 2016
Copyright (c) 2016 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3. Description of BGP-LU Procedures . . . . . . . . . . . . . . . 7
3.1. Partitioned-Unique Label Info Extended Community . . . . . 10
3.2 Partition-Unique Label Info Extended Community Procedures . 11
3.3 BGP Policies on UPBNs and LMS . . . . . . . . . . . . . . . 13
3.4 BGP-LU Procedures for UP0 Destinations . . . . . . . . . . . 14
3.5 Advertising labels without partition label extended
community . . . . . . . . . . . . . . . . . . . . . . . . . 15
4. Route Resolution in HSDN Architecture . . . . . . . . . . . . . 16
5. Security Considerations . . . . . . . . . . . . . . . . . . . . 17
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 17
7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . 17
8. Normative References . . . . . . . . . . . . . . . . . . . . . 17
9. Informative References . . . . . . . . . . . . . . . . . . . . 18
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18
Fang et al. Expires <July 26, 2016> [Page 2]
Internet-Draft BGP-LU for HSDN January 23, 2016
1. Introduction
This document describes modifications to BGP Labeled Unicast (BGP-
LU)-based procedures for label distribution [RFC3107] in a
partitioned network where a label stack is used for forwarding.
Current BGP-LU procedures do not provide mechanisms for distributing
and installing operator-assigned partition-scope labels.
Specifically, the modifications described in this document are
suitable for label distribution in the control plane of a MPLS-based
Hierarchical SDN (HSDN) Data Center (DC) and cloud network.
Hierarchical SDN (HSDN) [I-D.fang-mpls-hsdn-for-hsdc] is an
architectural solution to scale a hyper-scale cloud consisting of DCs
interconnected by a Data Center Interconnect (DCI) to tens of
millions of physical underlay endpoints, while efficiently handling
both Equal Cost Multi Path (ECMP) load-balanced traffic and any-to-
any end-to-end Traffic Engineered (TE) traffic. The HSDN reference
model, operation, and requirements are described in
[I-D.fang-mpls-hsdn-for-hsdc].
HSDN is designed to allow the physical decoupling of control and
forwarding, and have the LFIBs configured by a controller according
to a full SDN approach. Such a controller-centric approach is
described in [I-D.fang-mpls-hsdn-for-hsdc].
However, the HSDN control plane can also be built in a hybrid
approach, using a routing or label distribution protocol to
distribute the labels, together with a controller. This hybrid
approach may be particularly useful during technology migration. This
document specifies the use of BGP-LU for label distribution and LFIB
configuration in the HSDN control plane.
In the HSDN architecture, the DC/DCI network is partitioned into
hierarchical underlay partitions (UPs) such that the number of
destinations in each UP does not increase beyond the limit imposed by
capabilities of network nodes. Once the DC cloud has been partitioned
to the desired configuration, the traffic from a source endpoint to a
destination endpoint uses a stack of labels, one label per each level
in the hierarchy, whose semantics indicate to the forwarding network
nodes at each level which destination in its local UP should forward
the packet to. The label semantics can also identify a specific path
(or group of paths) in the UP, rather than simply a destination.
In other words, the label stack indirectly represents the UPs that
the packet should traverse to reach the destination end device. More
precisely, the outer label specifies the destination in the partition
at the highest level that the packet should traverse, while the other
Fang et al. Expires <July 26, 2016> [Page 3]
Internet-Draft BGP-LU for HSDN January 23, 2016
labels specify the destination in each partition that the packet
traverse thereafter.
UP0
\ +---------+ +---------+ +---------+ +---------+ /
\|UPBN1-1-1|~~~|UPBN1-1-2|-----------|UPBN1-2-1|~~~|UPBN1-2-2|/
+---------+ +---------+ +---------+ +---------+
( ) ( )
( UP1-1 ) ( UP1-2 )
( ) ( )
+---------+ +---------+ +---------+ +---------+
|UPBN2-1-1|~~~|UPBN2-1-2| |UPBN2-2-1|~~~|UPBN2-2-2|
+---------+ +---------+ +---------+ +---------+
( ) ( )
( UP2-1 ) ( UP2-2 )
( ) ( )
+---------+ +---------+ +---------+ +---------+
| Server1 |~~~| Server2 | | Server3 |~~~| Server4 |
+---------+ +---------+ +---------+ +---------+
Figure 1 - Example topology with 3 levels of partitioning
In the example of Figure 1, there are 3 levels in the hierarchical
partitioning. The UPs are connected by a number of Underlay Partition
Border Nodes (UPBNs), grouped in Underlay Partition Border Groups
(UPBGs). The UPBGs are the destinations for ECMP-forwarded traffic in
each partition.
Packets from Server3 to Server1 use a label stack consisting of 3
Path Labels (PLs) for forwarding.
- Top label (PL0) forwards the packet to one of the UPBN1-1 nodes,
which are grouped as UPBG1-1, connecting to UP1-1, which contains
Server1 (note that, by definition of HSDN forwarding, PL0 points to
UPBG1-1, i.e., the destination in UP0, rather than UPBG2-1).
- Next label (PL1) forwards the packet to one of the UPBN2-1 nodes,
which are grouped as UPBG2-1, connecting to UP2-1, which contains
Server1 (UPNBG2-1 is a destination in UP1-1).
- Next label (PL2) forwards the packet to Server1 (which is a
destination in UP2-1)
This document proposes modified BGP-LU based procedures for:
- How each UPBN learns the destinations in its UP and the operator
Fang et al. Expires <July 26, 2016> [Page 4]
Internet-Draft BGP-LU for HSDN January 23, 2016
assigned partition unique labels that should be installed in its
LFIB to forward traffic to these destinations;
- How UPBN learns the context labels used by other UPBN destinations
in the partition if the DC operator implements a policy of using
separate LFIBs for installing partition unique labels on UPBNs
We also introduce an associated new extended community [RFC4360] that
serves the following purposes:
1. Enables a UPBN to trigger the modified BGP-LU behavior to allow
distribution of partition-unique labels to UPBNs from Label
Mapping Server (LMS), and
2. Identifies which LFIB partition unique labels should be installed
into (if there is ambiguity due to overlapping label name spaces),
and
Such extended community allows to advertise persistent labels, which
can survive across BGP session restarts.
Strictly speaking, the labels advertised with the new mechanisms
described in this document are not typical downstream-advertised
labels, but they are more similar to upstream-advertised labels
installed in context LFIBs corresponding to upstream.
It should be noted that the BGP-LU procedures specified in this
document may be implemented through operator configured policy using
any existing BGP community types if some conditions are met. The
minor changes to the procedures and the conditions under which policy
based application of an existing BGP community can be used are
described in Section 3.5.
The procedures specified in the document are applicable to ECMP
traffic in mpls-based HSDN DC cloud architectures.
2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
This document inherits the terminology defined in
[I-D.fang-mpls-hsdn-for-hsdc] and additionally introduces the
following terms that apply when BGP-LU based control plane is used to
realize HSDN architecture.
o Border Node (BN): A border node is a node that is present in a UP.
Fang et al. Expires <July 26, 2016> [Page 5]
Internet-Draft BGP-LU for HSDN January 23, 2016
In HSDN architecture, UPBNi is a special BN that connects UPi with
UPi-1.
o Partition Label Space: Label space that is shared by all border
nodes of a UP to reach a destination in the UP. For a border node,
UP destinations comprise other border nodes and end devices that
are present in the UP.
o Partition Labels: Operator assigned labels that belong to
partition label space corresponding to a UP. The labels need not
be allocated from the platform label space on the BNs but may be
directly installed in the context table corresponding the UP.
o Label Mapping Server (LMS): A BGP speaker present in each UP that
allocates labels for destinations in the partition and distributes
the labels to border nodes through BGP-LU.
o BGP Peer Group: Collection of BGP peers for which a set of
policies are applied on a BGP speaker.
o Partition-Unique Label Info Community: A new type of BGP extended
community that contains the operator assigned partition unique
label for the BGP destination, origin partition and border group
identifier.
o Border-group Community: This community identifies a group of
border nodes that interconnect two partitions and is configured as
policy on the border nodes as well as the LMS. It acts as the UPBG
identifier.
o Route Resolver: A single or a collection of entities that provides
the MPLS label stack to reach a destination underlay end device.
Term Definition
----------- --------------------------------------------------
BGP Border Gateway Protocol
BGP-LU Border Gateway Protocol Labeled Unicast
BN Border Node
DC Data Center
DCI Data Center Interconnect
ECMP Equal Cost MultiPathing
FIB Forwarding Information Base
HSDN Hierarchical SDN
LFIB Label Forwarding Information Base
LMS Label Mapping Server
MPLS Multi-Protocol Label Switching
SDN Software Defined Network
UP Underlay Partition
Fang et al. Expires <July 26, 2016> [Page 6]
Internet-Draft BGP-LU for HSDN January 23, 2016
UPBG Underlay Partition Border Group
UPBN Underlay Partition Border Node
TE Traffic Engineering
3. Description of BGP-LU Procedures
This section provides an overview of how operator assigned partition
label space is used to achieve end-to-end forwarding of label stacked
packets. Consider the DC network that is present in the right hand
side DC in Figure 1. The diagram in Figure 2 is a part of the DCI
network in Figure 1 (the partitions are arranged horizontally rather
than vertically as in Figure 1). UP1 in Figure 2 denotes a level 1 UP
and UP2 denotes a level 2 UP. BN1 and BN2 are UPBNs of UP1, BN3 and
BN4 are UPBNs of UP2. The nodes BN5 and BN6 may be some ToR switches
or Servers. The nodes BN3, BN4, BN2, and BN1 are internal to the
DC/DCI network (leafs and spines).
~~~~~~~~~ ~~~~~~~~~
+-----+ ( ) +-----+ ( ) +-----+
| BN1 |-( )-| BN3 |-( )-| BN5 |
+-----+ ( ) +-----+ ( ) +-----+
( UP1 ) ( UP2 )
+-----+ ( ) +-----+ ( ) +-----+
| BN2 |-( )-| BN4 |-( )-| BN6 |
+-----+ ( ) +-----+ ( ) +-----+
~~~~~~~~~ ~~~~~~~~~
Figure 2 - Example to illustrate partition labels
If the DC network in Figure 2 ran conventional flat distributed BGP-
LU control plane using router-allocated labels, when BN5 advertises
itself as destination to BN3, BN3 allocates a new label (say L35)
from its platform label space. If BN3 finds BN5 reachable (through
say LSP35), it advertises L35 (for destination BN5) to BN1.
Similarly, BN1 finds BN3 reachable (through say LSP13) and pushes two
labels - bottom label is L35 and top label is the LSP13 label. In
this model, BN3 stitches L35 to LSP35 that takes the packet to BN5.
The same procedure runs on BN4, which allocates a label (say L45, in
general different from L35) from its own platform label space for BN5
and advertises the label to BN1. This model is not suitable when end-
to-end traffic from a Server behind BN1 or BN2 (not shown in the
figure) to a Server behind BN5 or BN6 (not shown in the figure) needs
to be forwarded using a label stack imposed by the SDN Controller
with the condition that the label stack does not depend on the BN
traversed to reach UP2 from UP1.
This document specifies a mechanism to implement the forwarding model
Fang et al. Expires <July 26, 2016> [Page 7]
Internet-Draft BGP-LU for HSDN January 23, 2016
using label stacks imposed by SDN Controller but not have the
limitation described in previous paragraph. The new procedures
introduced in this document are explained using the above example.
1. BN5 and BN6 advertise their own loopback addresses in UP2.
Assuming BN5 and BN6 do not belong to any border group, the BGP-LU
advertisements from BN5 and BN6 contain NULL label. The routes
will be:
{Nlri: BN5, Label: NULL, Nh: BN5}
{Nlri: BN6, Label: NULL, Nh: BN6}
2. BN3 and BN4 do not allocate labels for BN5 and BN6 from their own
platform label space when they receive the BGP-LU advertisements.
This is because BN3 and BN4 are configured to be part of a border
group for UP2 destinations. Both BN3 and BN4 are configured with
border group community "Border-group-2".
3. BN3 and BN4 re-advertise BN5 and BN6 as IP NLRI destinations (with
BGP next-hop self) to the LMS assigned for UP2 and appends
"Partition-Unique Label Info" extended community . The Partition-
Unique Label Info extended community and the procedures relating
to it are newly introduced in this document. Refer to Section 3.1
for the extended community format and Section 3.2 for LMS
procedures. The R-bit in the extended community is set to indicate
that the originator requests the receiver to assign and reflect
the partition label info community with the label assigned by LMS.
The routes for BN5 destination will be:
{Nlri: BN5, Nh: BN3, Com: Border-group-2, Label-Ext-Comm: R}
{Nlri: BN5, Nh: BN4, Com: Border-group-2, Label-Ext-Comm: R}
If the operator has set aside a BGP community value that
unambiguously indicates that the next-hop (BN3 or BN4) in the BGP
route requests a label to be allocated for the destination (BN5)
in UP2 partition, then the newly specified Partition label info
extended community may not be added to the route. Refer to Section
3.5 for details.
4. UP2 LMS processes the IP routes for BN5 and BN6, assigns labels
for them (or simply reads the labels from label mapping database
configured by operator) and originates a BGP-LU route containing
the label assigned for the UP2 destinations. LMS may set the P-bit
to indicate that the label can be persistent and can be retained
for a specified time period. For the two IP routes for BN5
originated by BN3 and BN4, the BGP-LU routes originated by LMS
will be:
{Nlri: BN5, Label: L5, Nh: BN3, Com: Border-group-2, Label-Ext-
Comm: P:UP2-context}
{Nlri: BN5, Label: L5, Nh: BN4, Com: Border-group-2, Label-Ext-
Fang et al. Expires <July 26, 2016> [Page 8]
Internet-Draft BGP-LU for HSDN January 23, 2016
Comm: P:UP2-context}
The procedures if newly specified partition label info extended
community is not used are described in Section 3.5.
5. Only when BN3 and BN4 learn the BGP-LU route for BN5 advertised by
LMS of UP2, they install the label route in context table
corresponding to UP2-context. Note that the operator may configure
BN3 and BN4 to install the operator assigned label for BN5 in main
LFIB itself (instead of UP2-context). The operator may choose this
option if non-overlapping labels are assigned for different UPs.
6. BN3 and BN4 do not advertise BN5 and BN6 in UP1 but only advertise
their own loopback addresses. As BN3 and BN4 are configured to be
part of a border group, the border group identifier advertised as
community is the same in BGP-LU advertisements from BN3 and BN4.
If the partitions may have overlapping label spaces, then BN3 and
BN4 advertise non-NULL labels in their BGP-LU advertisements. BN3
and BN4 install the label (that gets advertised) in default LFIB
and point the label entry to the context table for UP2. In such a
case, the routes from BN3 and BN4 will be:
{Nlri: BN3, Label: CL3, Nh: BN3, Com: Border-group-1}
{Nlri: BN4, Label: CL4, Nh: BN4, Com: Border-group-1}
7. BN1 and BN2 do not allocate labels for BN3 and BN4 from their
platform label space when they receive BGP-LU advertisement. BN1
and BN2 only use the BGP-LU advertisement from BN3 and BN4 for
determining the labels to be pushed during forwarding. Note that
if there are intermediate routers between BN1/BN2 and BN3/BN4,
then the labels CL3 and CL4 advertised by BN3 and BN4 will be used
by those intermediate routers for determining the labels to be
pushed.
8. BN1 and BN2 re-advertise BN3 and BN4 as IP destinations (with BGP
next-hop self) to the LMS assigned for UP2 and appends "Partition-
Unique Label Info" extended community. The R-bit is set to
indicate that the originator requests the receiver to assign and
reflect the partition label info community with the label assigned
by LMS. The routes for BN3 destination will be:
{Nlri: BN3, Nh: BN1, Com: Border-group-1, Label-Ext-Comm: R}
{Nlri: BN3, Nh: BN2, Com: Border-group-1, Label-Ext-Comm: R}
The procedures if newly specified partition label info extended
community is not used are described in Section 3.5.
9. UP1 LMS processes the IP routes for BN3 and BN4, assigns labels
for them (or simply reads the labels from label mapping database
configured by operator) and originates a BGP-LU route containing
Fang et al. Expires <July 26, 2016> [Page 9]
Internet-Draft BGP-LU for HSDN January 23, 2016
the label assigned for the UP1 destinations. As the group label
advertisements will differ only in BGP next-hop, BGP add-path
should be enabled on the peer group between LMS and BNs. LMS may
set P-bit to indicate that the advertised label can be persistent
and can be retained for specified time. For the two IP routes for
BN3 originated by BN1 and BN2, the BGP-LU routes originated by LMS
will be:
{Nlri: BN3, Label: L3, Nh: BN1, Com: Border-group-1, Label-Ext-
Comm: P:UP1-context}
{Nlri: BN3, Label: LG2, Nh: BN1, Com: Border-group-1, Label-Ext-
Comm: PG:UP1-context}
{Nlri: BN3, Label: L3, Nh: BN2, Com: Border-group-1, Label-Ext-
Comm: P:UP1-context}
{Nlri: BN3, Label: LG2, Nh: BN2, Com: Border-group-1, Label-Ext-
Comm: PG:UP1-context}
Note that there are two BGP-LU routes with same NLRI for
advertising group label and so BGP add-path
[I-D.ietf-idr-add-paths] should be enabled between LMS and BNs.
10. Only when BN1 and BN2 learn the BGP-LU route for BN3 advertised
by LMS of UP1, they install the label route in context table that
has been configured on BN1 and BN2 to contain all UP1
destinations. Note that the operator may configure BN1 and BN2 to
install the operator assigned label for BN3 in main LFIB itself
(instead of UP1-context). The operator may choose this option if
non-overlapping labels are assigned for different UPs.
Apart from advertising partition labels to BNs, the LMSs also
advertise the routes (IP routes received from BNs as well as the BGP-
LU routes originated back to BNs) to Route Resolver. Resolver is
logically centralized component that constructs label stacks for end-
to-end traffic and it uses the routes advertised from LMSs as inputs
for constructing label stacks.
The description of the procedures using the example DC network in
Figure 2 provides an overview of how the LFIB states are set up for
traffic entering BN1 or BN2 is forwarded to BN5 or BN6 ("downward
traffic"). It should be be noted that this overview has not explained
how packets from a source in a remote DC can reach BN5 or BN6. In
other words, the overview has not yet explained how packets are
exchanged between servers in one DC to the other DC in Figure 1. The
description of how the LFIB states are setup for "upward traffic" is
presented in Section 3.4.
3.1. Partitioned-Unique Label Info Extended Community
This document introduces a new extended community that enables the
Fang et al. Expires <July 26, 2016> [Page 10]
Internet-Draft BGP-LU for HSDN January 23, 2016
originator of a BGP-LU route to convey the information specified
below.
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type=TBD | Sub-Type=TBD | Flags(1 octet)| Reserved=0 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Partition context identifier |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Partition Label Retention Period |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Flags
R-bit: Set to 1 if the originator requests label
G-bit: Set to 1 if the label is a group label
P-bit: Set to 1 if the receiver can retain label for specified
time even if BGP peering between LMS and BN is lost
Partition context identifier: Context table identifier to which label
will be installed
Partition label retention period: Timer period in seconds that the
label can be retained after the BGP peering between LMS and BN is
lost. This value must be zero if P-bit is not set.
3.2 Partition-Unique Label Info Extended Community Procedures
LMS is a BGP speaker that implements the following new procedures
when it receives an IP route BGP advertisement containing "Partition-
Unique Label Info" extended community.
- If IGP is the routing protocol with in a UP, then LMS may be
implemented as a modified Route Reflector (RR) [RFC4456] assigned
for the UP.
- If eBGP runs with in a UP, then the BGP peering between LMS and
each border node should be configured by operator and on the BNs
the eBGP peering with LMS should be configured in a peer group
separate from eBGP peering with other routers in the partition.
Note that even if eBGP is in use, the LMS procedures may be
considered to act as a "modified reflector" because the primary
goal of LMS is to return back the partition label to BN.
- LMS is configured with all the border groups that are connected to
the UP where each border group is identified by a unique value of
Fang et al. Expires <July 26, 2016> [Page 11]
Internet-Draft BGP-LU for HSDN January 23, 2016
Border-group community.
When LMS receives an IP route advertisement whose NLRI and BGP next-
hop are the same, then it executes the following procedure.
1. If the operator has already assigned a label (DstLabel) for the UP
destination in the NLRI, then no action is performed.
2. If the operator has not assigned a label for the UP destination,
then LMS allocates a label (DstLabel) and stores the mapping
between the UP destination and the label.
3. If the IP route advertisement also contains a known Border-group
community and if the operator has not assigned a label for the
border group, then LMS allocates a label and stores the mapping
between the Border-group and the allocated label. Let the label
assigned or allocated be BGLabel. LMS also stores the NLRI to the
list of nodes belonging to the Border-group community contained in
the route.
4. After executing the following procedures, LMS advertises the IP
route to the Route Resolver.
When LMS receives an IP route advertisement whose NLRI and BGP next-
hop are different, then it executes the following procedures.
1. If the IP route advertisement does not contain "Partition-Unique
Label Info" extended community, then no further action is taken.
Alternatively, if the LMS is configured with a policy to interpret
a BGP community configured on it as equivalent to "partition label
info" extended community, then the subsequent steps may be
executed (refer to Section 3.5 for details).
2. If the IP route advertisement contains "Partition-Unique Label
Info" extended community but the BGP next-hop does not belong to
any known Border-group community configured on the LMS, then no
further action is taken.
3. If none of the above conditions is true, then the LMS executes the
following procedures.
a. LMS retrieves the DstLabel label already assigned for the UP
destination. LMS originates BGP-LU route with DstLabel set in
the NLRI and clears the G-bit in "Partition-Unique Label Info"
extended community. If the partition labels are operator
assigned and is read from label mapping database, then LMS sets
P-bit in the extended community flags and sets the "partition
label retention period" to the value configured on LMS (default
Fang et al. Expires <July 26, 2016> [Page 12]
Internet-Draft BGP-LU for HSDN January 23, 2016
value is 7200 seconds).
b. If the NLRI of the IP route is equal to a known Border-group
community configured on the LMS, then the LMS also retrieves
the BGLabel assigned for the Border-group. LMS also originates
BGP-LU route with BGLabel set in the NLRI and sets the G-bit in
"Partition-Unique Label Info" extended community. If the
partition labels operator assigned and is read from label
mapping database, then LMS sets P-bit in the extended community
flags and sets the "partition label retention period" to the
value configured on LMS (default value is 7200 seconds).
When the BN that originated the IP route receives the BGP-LU route
"reflected" back by the LMS, it executes the following procedures.
1. BN first checks whether R-bit is cleared in "Partition-Unique
Label Info" extended community. If R-bit has been reset, the label
in the NLRI is installed in the context table corresponding to the
"partition context identifier" present in the extended community.
If "partition context identifier" is zero, then BN installs the
label entry in default LFIB.
2. If P-bit is set, then BN should retain the label entry in the
designated LFIB (context or default) for the time period specified
in "partition label retention period" should the BGP peering with
LMS is lost. After BGP peering with LMS is lost, the BN should
start "label retention timer" for the labels learnt from the LMS.
When the BGP peering is restored, BN should reset the "label
retention timer" and re-advertise IP routes corresponding to all
UP destinations it had originated before. This procedure ensures
that both LMS and BNs exchange all requisite routes before
reaching steady state again.
3. If P-bit is not set, then BN should delete the label entry
immediately when BGP peering with LMS is lost.
4. BN should delete the label entry from the LFIB when LMS withdraws
the BGP-LU route containing the "Partition-Unique Label Info"
extended community.
3.3 BGP Policies on UPBNs and LMS
The BGP-LU based control plane mechanism specified in this document
assumes the following set of policies be applied on various network
nodes in HSDN architecture. The policy configurations required are
listed below.
Fang et al. Expires <July 26, 2016> [Page 13]
Internet-Draft BGP-LU for HSDN January 23, 2016
- Each UPBN that connects two UPs are configured with a unique
Border-group to advertise membership to "border group" or UPBG. For
example, in figure 1 UPBN1-1-1 and UPBN1-1-2 are configured with
same Border-group community that uniquely represents the
connectivity of the two BNs to UP1-1.
- Depending the routing protocol used with in a UP, each UPBN should
either have iBGP or eBGP peering sessions such that all lower level
UPBNs or end-devices that are connected to the UP learn each other.
For example, the BNs present in UP1-1 in Figure 1 are UPBN1-1-1,
UPBN1-1-2, UPBN2-1-1 and UPBN2-1-2 and each of them should learn
the loopback address of the other BNs.
- Each UP should have a Label Mapping Server (LMS) that advertises to
all the UPBNs the operator assigned partition labels corresponding
to each UP destination. Destinations of UPi consists of all
individual UPBNi+1 connected to UPi and lower level UPBGs connected
to UPi. For example, destinations of UP1-1 (Figure 1) are UPBN2-1-
1, UPBN2-1-2 and UPBG2-1, and LMS-1-1 will assign and advertise
three labels for UP1-1.
- Each BN in a UP should also have iBGP or eBGP peering session with
LMS of the UP. For example, all BNs in UP1-1 should have eBGP
peering session with LMS-1-1 if UP1-1 runs eBGP routing protocol.
- UPBNj has a policy to automatically export destinations learnt from
UPBNi peer group to UPj peer group (where i=j-1). But UPBNj does
not export destinations learnt from UPj peer group to UPBNi peer
group. This export policy on UPBNj limits the number of BGP
advertisements that any network node in UPi has to process apart
from limiting the number of LFIB entries in network nodes.
3.4 BGP-LU Procedures for UP0 Destinations
It should be noted that in the example topology in Figure 2, the BNs
attached to UP1 and UP2 have been specified as UP destinations for
illustration purposes only. Even a remote destination can be
considered as a UP destination as long as the route is leaked into
the UP. In HSDN architecture, even though the BNs connected to UP0
are remote for the UPBNs from level 2 down to the leaf level, as long
as the normal BGP-LU route leaking policy (specified in Section 3.3)
is followed, the LMS of the level 2 (or lower level) UPs will have to
assign label for BNs in UP0 (or UP0 destinations). For example,
UPBN2-1-1 and UPBN2-1-2 (figure 1) will learn UPBN1-1-1, UPBN1-1-2,
UPBN1-2-1 and UPBN1-2-2 because UPBN1-2-1 and UPBN1-2-2 are leaked
into UP1-1.
In the DC cloud network specified in Figure 1, the following
Fang et al. Expires <July 26, 2016> [Page 14]
Internet-Draft BGP-LU for HSDN January 23, 2016
procedures are executed to enable packets with top label PL0 reach
one of UPBNs connecting to UP0. To obtain end-to-end forwarding using
a three label stack in a HSDN network with two levels (i.e. Servers
located in UP2-x), the LMS of all UP2-x and UP1-x are set up such
that they reflect the same label (i.e. PL0 label) for every UP0
destination (BNs as well as border groups).
1. UPBN1-2-1 and UPBN1-2-2 advertise their own loopback addresses in
UP0. As the UPBNs are configured to be part of a border group, the
border group community is the same in BGP-LU advertisements. If
the partitions may have overlapping label spaces, then UPBN1-2-1
and UPBN1-2-2 advertise non-NULL labels in their BGP-LU
advertisements. BN3 and BN4 install the label (that gets
advertised) in default LFIB and point the label entry to the
context table for UP1-2. In such a case, the routes from BN3 and
BN4 will be:
{Nlri: UPBN1-2-1, Label: CL11, Nh: UPBN1-2-1, Com: Border-group-0}
{Nlri: UPBN1-2-2, Label: CL12, Nh: UPBN1-2-2, Com: Border-group-0}
2. For UPBN1-1-1 and UPBN1-1-2, the routes to UPBN1-2-1 and UPBN1-2-2
are in same partition (i.e. UP0). The label assigned for UPBN1-2-
1, UPBN1-2-2 and UPBG1-2 are the same on LMS-0, LMS-1-1 and LMS-2-
1. So all BNs in the left hand side DC in Figure 1 install the
same label for UPBN1-2-1, UPBN1-2-2 and UPBG1-2.
Note that as all BNs in the DC cloud install the same label for a UP0
destination, the label range on the implementations of all BNs should
have common label space (among different platform label spaces on all
BNs) that can be set aside for the UP0 destinations. If this is not
possible, then all BNs should be configured with a separate context
table for UP0 partition. The BGP-LU procedures involving the
"Partition-unique label info" community supports both forms of
forwarding.
3.5 Advertising labels without partition label extended community
The procedures specified in Section 3.2 may be executed on LMS and
border nodes without using the newly partition label info extended
community but using an existing BGP community if all the following
conditions are true.
- Each partition has a separate LMS such that border nodes connecting
two partitions must have separate BGP peering with LMS of the two
partitions.
- Both LMS and BNs are configured with a BGP community and both LMS
and BNs interpret that community as an indication from the BGP peer
that the procedures specified in Section 3 of this document should be
Fang et al. Expires <July 26, 2016> [Page 15]
Internet-Draft BGP-LU for HSDN January 23, 2016
applied. If LMS receives IP route advertisement whose NLRI and next-
hop attribute are different and contains the pre-configured BGP
community, then LMS should interpret the update as label request from
the BGP peer for the IP destination corresponding to the NLRI.
Similarly, when BN receives BGP-LU advertisement for which the BN has
originated an IP route and if the BGP-LU advertisement contains the
pre-configured BGP community, then BN should interpret the update as
partition label advertisement from LMS for the IP destination
corresponding to the NLRI.
- BNs are configured with the LFIB to which the label advertised by
the LMS should be installed. In this model, LMS cannot advertise the
LFIB to which the label forwarding entry should be installed.
- Both LMS and BNs are configured with label retention policy in the
event of BGP peering between LMS and BNs were to fail. For example,
both LMS and BNs may be configured with label retention period of
7200 seconds so that BNs can retain the LFIB entry for 7200 seconds
even if BGP peering with LMS fails.
4. Route Resolution in HSDN Architecture
As a consequence of the procedures described in Section 3, Route
Resolver of the network will have the knowledge of the destinations
in all UPs and the UPBNs that have advertised those UP destinations.
Route Resolver uses this information to construct MPLS label stack to
forward the packet to desired destination End-device.
Note that the procedure specified in this Section is only for
illustration purpose and hence the implementation of Resolver is free
to choose a more optimal mechanism to obtain the same result. The
resolution for a given DstServer or End-device IP address works as
follows.
1. Resolver should have received all BGP-LU routes of all End-devices
from the LMSs of all "leaf" UPs with BGP next-hop specifying the
UPBN that serves the UP. The Resolver looks up the given DstServer
IP address in the resolution database. If the IP address is not
present, then Resolver considers the resolution as having failed.
2. If the DstServer has been advertised by LMS of a UP, then the
Resolver obtains the BGP next-hop from the BGP-LU route
advertisement. The BGP next-hop is the UPBN of the leaf UP. Note
that there may be multiple BGP-LU routes advertising the same
DstServer. Assuming the policy is to use ECMP for the traffic, the
Resolver picks the BGP-LU advertisement having G-bit set in
"Partition-Unique Label Info" extended community and adds the
BGLabel to the resulting stack. Assuming the DstServer is located
Fang et al. Expires <July 26, 2016> [Page 16]
Internet-Draft BGP-LU for HSDN January 23, 2016
in second level UP and LG2 is the group label, the stack will be
{LG2}.
3. Resolver then looks up the UPBN in the resolution database. If the
UPBN IP address is not present, then Resolver considers the
resolution as having failed. If there is one or more BGP-LU route
with the UPBN as the destination, then the Resolver obtains the
BGP next-hop(s). Assuming the policy is to use ECMP for the
traffic, the Resolver picks the BGP-LU advertisement having G-bit
set in "Partition-Unique Label Info" extended community and adds
the BGLabel to the resulting stack. Assuming LG1 is the group
label of level 1 UPBG, the stack will be {LG1, LG2}.
4. As the resolution has reached level 1 UPBN (that is a BN in UP0),
the Resolver looks up the level 1 UPBN in resolution database.
There should be multiple BGP-LU routes with level 1 UPBN as
destination. Assuming the policy is to use ECMP for the traffic,
the Resolver picks the BGP-LU advertisement having G-bit set in
"Partition-Unique Label Info" extended community and adds the
BGLabel to the resulting stack. Assuming LG0 is the group label of
level 0 BG, the stack will be {LG0, LG1, LG2}. At this point the
resolution is considered as successful (refer to Section 3.4) and
the Resolver returns the resultant label stack to the querying
system.
5. Security Considerations
The procedures defined in the document does not necessitate any
security considerations.
6. IANA Considerations
This document defines a new extended community type (see Section
3.1).
7. Acknowledgments
We would like to thank Kaliraj Vairavakkalai and Balaji Rajagopalan
for their valuable input and feedback.
8. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC3107] Rekhter, Y. and E. Rosen, "Carrying Label Information in
BGP-4", RFC 3107, May 2001.
Fang et al. Expires <July 26, 2016> [Page 17]
Internet-Draft BGP-LU for HSDN January 23, 2016
[RFC4456] Bates, T., Chen, E., and R. Chandra, "BGP Route
Reflection: An Alternative to Full Mesh Internal BGP
(IBGP)", RFC 4456, April 2006.
[RFC4360] Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended
Communities Attribute", RFC 4360, February 2006.
9. Informative References
[I-D.fang-mpls-hsdn-for-hsdc] L. Fang, et. al., "MPLS-Based
Hierarchical SDN for Hyper-Scale DC/Cloud", draft-fang-
mpls-hsdn-for-hsdc-04 (work in progress), July 2015.
[I-D.ietf-idr-add-paths] D. Walton et al., "Advertisement of
Multiple Paths in BGP", draft-ietf-idr-add-paths-13 (work
in progress), Dec. 2015.
Authors' Addresses
Luyuan Fang
Microsoft
15590 NE 31st St.
Redmond, WA 98052
Email: lufang@microsoft.com
Deepak Bansal
Microsoft
15590 NE 31st St.
Redmond, WA 98052
Email: dbansal@microsoft.com
Chandra Ramachandran
Juniper Networks
Bangalore, India
Email: csekar@juniper.net
Fabio Chiussi
Seattle, Washington 98116
Email: fabiochiussi@gmail.com
Nabil Bitar
Verizon
40 Sylvan Road
Waltham, MA 02145
Email: nabil.bitar@verizon.com
Yakov Rekhter
Fang et al. Expires <July 26, 2016> [Page 18]