Network Working Group | T. Morin, Ed. |
Internet-Draft | Orange |
Intended status: Standards Track | R. Kebler, Ed. |
Expires: August 4, 2020 | Juniper Networks |
G. Mirsky, Ed. | |
ZTE Corp. | |
February 1, 2020 |
Multicast VPN fast upstream failover
draft-ietf-bess-mvpn-fast-failover-09
This document defines multicast VPN extensions and procedures that allow fast failover for upstream failures, by allowing downstream PEs to take into account the status of Provider-Tunnels (P-tunnels) when selecting the upstream PE for a VPN multicast flow, and extending BGP MVPN routing so that a C-multicast route can be advertised toward a standby upstream PE.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on August 4, 2020.
Copyright (c) 2020 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
In the context of multicast in BGP/MPLS VPNs, it is desirable to provide mechanisms allowing fast recovery of connectivity on different types of failures. This document addresses failures of elements in the provider network that are upstream of PEs connected to VPN sites with receivers.
Section 3 describes local procedures allowing an egress PE (a PE connected to a receiver site) to take into account the status of P-tunnels to determine the Upstream Multicast Hop (UMH) for a given (C-S, C-G). This method does not provide a "fast failover" solution when used alone, but can be used with the following sections for a "fast failover" solution.
Section 4 describes protocol extensions that can speed up failover by not requiring any multicast VPN routing message exchange at recovery time.
Moreover, section 5 describes a "hot leaf standby" mechanism, that uses a combination of these two mechanisms. This approach has similarities with the solution described in [RFC7431] to improve failover times when PIM routing is used in a network given some topology and metric constraints.
The terminology used in this document is the terminology defined in [RFC6513] and [RFC6514].
x-PMSI: I-PMSI or S-PMSI
Current multicast VPN specifications, section 5.1, describe the procedures used by a multicast VPN downstream PE to determine what the upstream multicast hop (UMH) is for a given (C-S, C-G).
The procedure described here is an OPTIONAL procedure that consists of having a downstream PE take into account the status of P-tunnels rooted at each possible upstream PEs, Because all PEs could arrive at a different conclusion regarding the state of the tunnel, procedures described in Section 9.1.1 of [RFC6513] MUST be used when using inclusive tunnels.
For a given downstream PE and a given VRF, the P-tunnel corresponding to a given upstream PE for a given (C-S, C-G) state is the S-PMSI tunnel advertised by that upstream PE for this (C-S, C-G) and imported into that VRF, or if there isn't any such S-PMSI, the I-PMSI tunnel advertised by that PE and imported into that VRF.
There are three options specified in Section 5.1 of [RFC6513] for a downstream PE to select an Upstream PE.
If the resulting candidate set is empty, then the procedure is repeated without considering the P-tunnel status.
Different factors can be considered to determine the "status" of a P-tunnel and are described in the following sub-sections. The optional procedures proposed in this section also allow that all downstream PEs don't apply the same rules to define what the status of a P-tunnel is (please see Section 6), and some of them will produce a result that may be different for different downstream PEs. Thus what is called the "status" of a P-tunnel in this section, is not a characteristic of the tunnel in itself, but is the status of the tunnel, as seen from a particular downstream PE. Additionally, some of the following methods determine the ability of downstream PE to receive traffic on the P-tunnel and not specifically on the status of the P-tunnel itself. That could be referred to as "P-tunnel reception status", but for simplicity, we will use the terminology of P-tunnel "status" for all of these methods.
Depending on the criteria used to determine the status of a P-tunnel, there may be an interaction with another resiliency mechanism used for the P-tunnel itself, and the UMH update may happen immediately or may need to be delayed. Each particular case is covered in each separate sub-section below.
A condition to consider that the status of a P-tunnel is up is that the root of the tunnel, as determined in the PMSI tunnel attribute, is reachable through unicast routing tables. In this case, the downstream PE can immediately update its UMH when the reachability condition changes.
That is similar to BGP next-hop tracking for VPN routes, except that the address considered is not the BGP next-hop address, but the root address in the PMSI tunnel attribute.
If BGP next-hop tracking is done for VPN routes and the root address of a given tunnel happens to be the same as the next-hop address in the BGP auto-discovery route advertising the tunnel, then using this mechanism for the tunnel will not bring any specific benefit.
A condition to consider a tunnel status as Up can be that the last-hop link of the P-tunnel is up.
Using this method when a fast restoration mechanism (such as MPLS FRR [RFC4090]) is in place for the link requires careful consideration and coordination of defect detection intervals for the link and the tunnel. In many cases, it is not practical to use both methods at the same time.
For P-tunnels of type P2MP MPLS-TE, the status of the P-tunnel is considered up if the sub-LSP to this downstream PE is in Up state. The determination of whether a P2MP RSVP-TE LSP is in Up state requires Path and Resv state for the LSP and is based on procedures specified in [RFC4875]. As a result, the downstream PE can immediately update its UMH when the reachability condition changes.
When signaling state for a P2MP TE LSP is removed (e.g., if the ingress of the P2MP TE LSP sends a PathTear message) or the P2MP TE LSP changes state from Up to Down as determined by procedures in [RFC4875], the status of the corresponding P-tunnel SHOULD be re-evaluated. If the P-tunnel transitions from up to Down state, the upstream PE that is the ingress of the P-tunnel SHOULD NOT be considered a valid UMH.
An upstream PE SHOULD be removed from the UMH candidate list for a given (C-S, C-G) if the P-tunnel (I-PMSI or S-PMSI) for this (S, G) is leaf-triggered (PIM, mLDP), but for some reason, internal to the protocol, the upstream one-hop branch of the tunnel from P to PE cannot be built. As a result, the downstream PE can immediately update its UMH when the reachability condition changes.
In cases, where the downstream node can be configured so that the maximum inter-packet time is known for all the multicast flows mapped on a P-tunnel, the local per-(C-S, C-G) traffic counter information for traffic received on this P-tunnel can be used to determine the status of the P-tunnel.
When such a procedure is used, in the context where fast restoration mechanisms are used for the P-tunnels, a configurable timer MUST be configured on the downstream PE to wait before updating the UMH, to let the P-tunnel restoration mechanism happen. It is RECOMMENDED to provide a reasonable default value for this timer. An implementation SHOULD use three seconds as the default value for this timer.
This method can be applicable, for instance, when a (C-S, C-G) flow is mapped on an S-PMSI.
In cases where this mechanism is used in conjunction with the method described in Section 5, no prior knowledge of the rate of the multicast streams is required; downstream PEs can compare reception on the two P-tunnels to determine when one of them is down.
P-tunnel status MAY be derived from the status of a multipoint BFD session [RFC8562] whose discriminator is advertised along with an x-PMSI A-D route.
This document defines the format and ways of using a new BGP attribute called the "BFD Discriminator". It is an optional transitive BGP attribute. The format of this attribute is defined as follows:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | BFD Mode | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | BFD Discriminator | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ~ Optional TLVs ~ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Format of the BFD Discriminator Attribute
Where:
The length of a TLV MUST be aligned on four octets boundary.
The BFD Discriminator attribute SHALL be considered malformed if its length is not a non-zero multiple of four. If malformed, the UPDATE message SHALL be handled using the approach of "treat-as-withdraw" per [RFC7606].
When it is desired to track the P-tunnel status using a p2mp BFD session, the Upstream PE:
If the tracking of the P-tunnel by using a p2mp BFD session is enabled after the x-PMSI A-D route has been already advertised, the x-PMSI A-D Route MUST be re-sent with precisely the same attributes as before and the BFD Discriminator attribute included.
If the x-PMSI A-D route is advertised with P-tunnel status tracked using the p2mp BFD session and it is desired to stop tracking P-tunnel status using BFD, then:
Upon receiving the BFD Discriminator attribute in the x-PMSI A-D Route, the Downstream PE:
After the state of the p2mp BFD session is up, i.e., bfd.SessionState == Up, the session state will then be used to track the health of the P-tunnel.
According to [RFC8562], if the Downstream PE receives Down or AdminDown in the State field of the BFD control packet or associated with the BFD session Detection Timer expires, the BFD session is down, i.e., bfd.SessionState == Down. When the BFD session state is Down, then the P-tunnel associated with the BFD session MUST be declared down. As a result, the Downstream PE MAY initiate a switchover of the traffic from the Primary Upstream PE to the Standby Upstream PE only if the Standby Upstream PE deemed available. A different p2mp BFD session MAY be used to monitor the state of the P-tunnel from Standby Upstream PE.
If the Downstream PE's P-tunnel is already up when the Downstream PE receives the new x-PMSI A-D Route with BFD Discriminator attribute, the Downstream PE MUST accept the x-PMSI A-D Route and associate the value of BFD Discriminator field with the P-tunnel. The Upstream PE MUST follow procedures listed above in this section to bring the p2mp BFD session up and use it to monitor the state of the associated P-tunnel.
If the Downstream PE's P-tunnel is already up, its state being monitored by the p2mp BFD session, and the Downstream PE receives the new x-PMSI A-D Route without the BFD Discriminator attribute, the Downstream PE:
The following approach is defined in response to the detection by the upstream PE of PE-CE link failure. Even though the provider tunnel is still up, it is desired for the downstream PEs to switch to a backup upstream PE. To achieve that, if the upstream PE detects that its PE-CE link fails, it SHOULD set the bfd.LocalDiag of the p2mp BFD session to Concatenated Path Down and/or Reverse Concatenated Path Down (per section 6.8.17 [RFC5880]), unless it switches to a new PE- CE link within the time of bfd.DesiredMinTxInterval for the p2mp BFD session (in that case the upstream PE will start tracking the status of the new PE-CE link). When a downstream PE receives that bfd.LocalDiag code, it treats as if the tunnel itself failed and tries to switch to a backup PE.
The procedures described below are limited to the case where the site that contains C-S is connected to two or more PEs though, to simplify the description, the case of dual-homing is described. The procedures require all the PEs of that MVPN to follow the UMH selection, as specified in [RFC6513], whether the PE selected based on its IP address, hashing algorithm described in section 5.1.3 [RFC6513], or Installed UMH Route. The procedures assume that if a site of a given MVPN that contains C-S is dual-homed to two PEs, then all the other sites of that MVPN would have two unicast VPN routes (VPN-IPv4 or VPN-IPv6) routes to C-S, each with its RD.
As long as C-S is reachable via both PEs, a given downstream PE will select one of the PEs connected to C-S as its Upstream PE for C-S. We will refer to the other PE connected to C-S as the "Standby Upstream PE". Note that if the connectivity to C-S through the Primary Upstream PE becomes unavailable, then the PE will select the Standby Upstream PE as its Upstream PE for C-S. When the Primary PE later becomes available, then the PE will select the Primary Upstream PE again as its Upstream PE. Such behavior is referred to as "revertive" behavior and MUST be supported. Non-revertive behavior would refer to the behavior of continuing to select the backup PE as the UMH even after the Primary has come up. This non-revertive behavior MAY also be supported by an implementation and would be enabled through some configuration.
For readability, in the following sub-sections, the procedures are described for BGP C-multicast Source Tree Join routes, but they apply equally to BGP C-multicast Shared Tree Join routes failover for the case where the customer RP is dual-homed (substitute "C-RP" to "C-S").
When a (downstream) PE connected to some site of an MVPN needs to send a C-multicast route (C-S, C-G), then following the procedures specified in Section "Originating C-multicast routes by a PE" of [RFC6514] the PE sends the C-multicast route with RT that identifies the Upstream PE selected by the PE originating the route. As long as C-S is reachable via the Primary Upstream PE, and the Upstream PE is the Primary Upstream PE. If C-S is reachable only via the Standby Upstream PE, then the Upstream PE is the Standby Upstream PE.
If C-S is reachable via both the Primary and the Standby Upstream PE, then in addition to sending the C-multicast route with an RT that identifies the Primary Upstream PE, the PE also originates and sends a C-multicast route with an RT that identifies the Standby Upstream PE. This route that has the semantics of being a 'standby' C-multicast route is further called a "Standby BGP C-multicast route", and is constructed as follows:
The normal and the standby C-multicast routes MUST have their Local Preference attribute adjusted so that, if two C-multicast routes with same NLRI are received by a BGP peer, one carrying the "Standby PE" community and the other one not carrying the "Standby PE" community, then preference is given to the one not carrying the "Standby PE" community. Such a situation can happen when, for instance, due to transient unicast routing inconsistencies or lack of support of the Standby PE community, two different downstream PEs consider different upstream PEs to be the primary one; in that case, without any precaution taken, both upstream PEs would process a standby C-multicast route and possibly stop forwarding at the same time. For this purpose, routes that carry the "Standby PE" BGP Community MUST have the LOCAL_PREF attribute set to zero.
Note that, when a PE advertises such a Standby C-multicast join for a (C-S, C-G) it MUST join the corresponding P-tunnel.
If at some later point the local PE determines that C-S is no longer reachable through the Primary Upstream PE, the Standby Upstream PE becomes the Upstream PE, and the local PE re-sends the C-multicast route with RT that identifies the Standby Upstream PE, except that now the route does not carry the Standby PE BGP Community (which results in replacing the old route with a new route, with the only difference between these routes being the presence/absence of the Standby PE BGP Community). Also, a LOCAL_PREF attribute MUST be set to zero.
When a PE receives a C-multicast route for a particular (C-S, C-G), and the RT carried in the route results in importing the route into a particular VRF on the PE, if the route carries the Standby PE BGP Community, then the PE performs as follows:
Furthermore, irrespective of whether C-S carried in that route is reachable through some other PE:
Doing neither (a) or (b) for a given (C-S, C-G) is called "cold root standby".
Doing (a) but not (b) for a given (C-S, C-G) is called "warm root standby".
Doing (b) (which implies also doing (a)) for a given (C-S, C-G) is called "hot root standby".
Note that, if an upstream PE uses an S-PMSI only policy, it shall advertise an S-PMSI for a (C-S, C-G) as soon as it receives a C-multicast route for (C-S, C-G), normal or Standby; i.e., it shall not wait for receiving a non-Standby C-multicast route before advertising the corresponding S-PMSI.
Section 9.3.2 of [RFC6514], describes the procedures of sending a Source-Active A-D result as a result of receiving the C-multicast route. These procedures should be followed for both the normal and Standby C-multicast routes.
The standby PE can use the following information to determine that C-S can or cannot be reached through the primary PE:
If the non-segmented inter-AS approach is used, the procedures in section 4 can be applied.
When multicast VPNs are used in an inter-AS context with the segmented inter-AS approach described in section 8.2 of [RFC6514], the procedures in this section can be applied.
A pre-requisite for the procedures described below to be applied for a source of a given MVPN is:
As an example, these conditions will be satisfied when the source is dual-homed to an AS that connects to the receiver AS through two ASBR using auto-configured RDs.
The following procedure is applied by downstream PEs of an AS, for a source S in a remote AS.
Additionally, to choosing an Inter-AS I-PMSI auto-discovery route advertised from the AS of the source to construct a C-multicast route, as described in section 11.1.3 a downstream PE will choose a second Inter-AS I-PMSI auto-discovery route advertised from the AS of the source and use this route to construct and advertise a Standby C-multicast route (C-multicast route carrying the Standby extended community) as described in Section 4.1.
When an upstream ASBR receives a C-multicast route, and at least one of the RTs of the route matches one of the ASBR Import RT, the ASBR, that supports this specification, MUST locate an Inter-AS I-PMSI A-D route whose RD and Source AS respectively match the RD and Source AS carried in the C-multicast route. If the match is found, and C-multicast route carries the Standby PE BGP Community, then the ASBR MUST perform as follows:
Other ASBR procedures are applied without modification.
The mechanisms defined in sections Section 4 and Section 3 can be used together as follows.
The principle is that, for a given VRF (or possibly only for a given C-S,C-G):
Other combinations of the mechanisms proposed in Section 4 and Section 3 are for further study.
Note that the same level of protection would be achievable with a simple C-multicast Source Tree Join route advertised to both the primary and secondary upstream PEs (carrying as Route Target extended communities, the values of the VRF Route Import attribute of each VPN route from each upstream PEs). The advantage of using the Standby semantic for is that, supposing that downstream PEs always advertise a Standby C-multicast route to the secondary upstream PE, it allows to choose the protection level through a change of configuration on the secondary upstream PE, without requiring any reconfiguration of all the downstream PEs.
Multicast VPN specifications impose that a PE only forwards to CEs the packets coming from the expected upstream PE (Section 9.1).
We highlight the reader's attention to the fact that the respect of this part of multicast VPN specifications is especially important when two distinct upstream PEs are susceptible to forward the same traffic on P-tunnels at the same time in the steady state. That will be the case when "hot root standby" mode is used (Section 4), and which can also be the case if procedures of Section 3 are used and (a) the rules determining the status of a tree are not the same on two distinct downstream PEs or (b) the rule determining the status of a tree depends on conditions local to a PE (e.g., the PE-P upstream link being up).
IANA is requested to allocate the BGP "Standby PE" community value (TBA1) from the Border Gateway Protocol (BGP) Well-known Communities registry.
This document defines a new BGP optional transitive attribute, called "BFD Discriminator". IANA is requested to allocate a codepoint (TBA2) in the "BGP Path Attributes" registry to the BFD Discriminator attribute.
IANA is requested to create a new BFD Mode sub-registry in Border Gateway Protocol (BGP) Parameters registry as described in Table 1.
Range | Registration Procedures | Note |
---|---|---|
0-249 | Standards Action | |
250-253 | Specification Required | Experimental |
254 | Private Use | |
255 | Standards Action |
IANA is requested to allocate the following values from the BFD Mode sub-registry as defined in Table 2.
Value | Description | Reference |
---|---|---|
0 | Reserved | This document |
TBA3 | P2MP BFD Session | This document |
255 | Reserved | This document |
IANA is requested to create a new BFD Discriminator Extention Type sub-registry in Border Gateway Protocol (BGP) Parameters registry as described in Table 3.
Value | Description | Reference |
---|---|---|
0 | Reserved | |
1-191 | Unassigned | IETF Review |
192-251 | Unassigned | First Come First Served |
252-254 | Unassigned | Private Use |
255 | Reserved |
This document describes procedures based on [RFC6513] and [RFC6514] and hence shares the security considerations respectively represented in these specifications.
This document makes use of BFD, as defined in [RFC8562], which, in turn, is based on [RFC5880]. Security considerations relevant to each protocol are discussed in the respective protocol specifications.
The authors want to thank Greg Reaume, Eric Rosen, Jeffrey Zhang, and Zheng (Sandy) Zhang for their reviews, useful comments, and helpful suggestions.
Rahul Aggarwal Arktan Email: raggarwa_1@yahoo.com Nehal Bhau Cisco Email: NBhau@cisco.com Clayton Hassen Bell Canada 2955 Virtual Way Vancouver CANADA Email: Clayton.Hassen@bell.ca Wim Henderickx Nokia Copernicuslaan 50 Antwerp 2018 Belgium Email: wim.henderickx@nokia.com Pradeep Jain Nokia 701 E Middlefield Rd Mountain View, CA 94043 USA Email: pradeep.jain@nokia.com Jayant Kotalwar Nokia 701 E Middlefield Rd Mountain View, CA 94043 USA Email: Jayant.Kotalwar@nokia.com Praveen Muley Nokia 701 East Middlefield Rd Mountain View, CA 94043 U.S.A. Email: praveen.muley@nokia.com Ray (Lei) Qiu Juniper Networks 1194 North Mathilda Ave. Sunnyvale, CA 94089 U.S.A. Email: rqiu@juniper.net Yakov Rekhter Juniper Networks 1194 North Mathilda Ave. Sunnyvale, CA 94089 U.S.A. Email: yakov@juniper.net Kanwar Singh Nokia 701 E Middlefield Rd Mountain View, CA 94043 USA Email: kanwar.singh@nokia.com
Below is a list of other contributing authors in alphabetical order:
[RFC4090] | Pan, P., Swallow, G. and A. Atlas, "Fast Reroute Extensions to RSVP-TE for LSP Tunnels", RFC 4090, DOI 10.17487/RFC4090, May 2005. |
[RFC7431] | Karan, A., Filsfils, C., Wijnands, IJ. and B. Decraene, "Multicast-Only Fast Reroute", RFC 7431, DOI 10.17487/RFC7431, August 2015. |