Network Working Group | M. Bhatia |
Internet-Draft | Alcatel-Lucent |
Intended status: Standards Track | M. Chen |
Expires: June 16, 2012 | Z. Wang |
Huawei Technologies Co., Ltd | |
L. Guo | |
China Telecom | |
M. Binderberger | |
December 16, 2011 |
Bidirectional Forwarding Detection (BFD) on Link Aggregation Group (LAG) Interfaces
draft-mmm-bfd-on-lags-01
This document proposes a mechanism to run BFD on Link Aggregation Group (LAG) interfaces. It does so by running an independent BFD session on every LAG member link.
A dedicated well-known multicast IP address for both IPv4 and IPv6 is introduced as the destination IP address of the BFD packets when running BFD on the member links of the LAG.
There is currently also no standard that describes how BFD runs on a LAG interface as a whole. This draft proposes a definition for this problem too while taking into consideration existing implementations.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119].
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http:/⁠/⁠datatracker.ietf.org/⁠drafts/⁠current/⁠.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on June 16, 2012.
Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http:/⁠/⁠trustee.ietf.org/⁠license-⁠info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
The Bidirectional Forwarding Detection (BFD) protocol [RFC5880] provides a mechanism to detect faults in the bidirectional path between two forwarding engines, including interfaces, data link(s), and to the extent possible the forwarding engines themselves, with potentially very low latency.
BFD can be used for detecting failures of the path between two network devices. Typically the application clients are not aware of any inner structure of the underlying interface, being layer 3 applications themselves like Open Shortest Path First (OSPF) [RFC2328] or Border Gateway Protocol (BGP)[RFC4271]. While this works for interfaces like Ethernet and Packet Over SONET (POS), it causes problems for bundled interfaces like LAG.
A LAG is used to bind together several physical ports between two adjacent nodes so they appear to higher-layer protocols as a single, higher bandwidth "virtual" pipe. A LAG interface thereby allows aggregation of multiple network interfaces as one virtual interface for the purpose of providing fault-tolerance and higher bandwidth.
The problem with running BFD over a LAG is that with a single BFD session and without internal knowledge of the LAG structure it is impossible for BFD to guarantee a detection of anything but a full LAG shutdown within the BFD timeout period. The LAG shutdown is typically initiated by some LAG module, which we will refer to as the LAG Management Module (LMM) in the rest of the document. LAG timers are typically multiple times slower than the BFD detection timers (multiple 100msec of LMM vs. multiple 10msec of BFD). There is thus a need to bring some sort of determinism in how BFD runs over a LAG. There is also a need to detect member link failures much faster than what Link Aggregation Control Protocol (LACP) allows.
The document proposes establishing a BFD session over every member link the LAG is built upon. BFD can combine these information to provide fast detection for layer-3 applications.
While there are native Ethernet mechanisms to detect failures (802.1ax, .3ah) that could be used for LAG, the solution proposed in this document enables operators who have already deployed BFD over different technologies (e.g. IP, MPLS) to use a common failure detection mechanism.
The simplest approach to run BFD on a LAG interface is to ignore the internal structure and treat the LAG as one "big pipe". To differentiate this mode of operation we call it "BFD over Big Pipe" or "BBP" for short. It corresponds to section 7.1 in RFC 5882 [RFC5882].
We need to standardize this approach. The following requirements define what it means to treat a LAG interface as a single interface with no additional structure:
The BFD session on the LAG interface then follows RFC 5880 and RFC 5881 in all details.
Because there is no standard, vendors have implemented their own proprietary mechanisms to run BFD over LAG interfaces. Two examples are shown here. Both satisfy the requirements in Section 2.1
Some implementations send BFD packets only over one member link . Others spray BFD packets over all member links of the LAG. There are issues with each of these approaches.
In the first approach, BFD sends packets onto the LAG and the LAG load balance algorithm will select a member port, which may be the same port for all the packets of this BFD session. BFD will remain up as long as this "primary" port is alive. It will go down once the primary port goes down till another port is selected as the primary. Problems arise with this design as BFD is oblivious to the presence of other member links in the LAG. If a non-primary link goes down, the BFD session remains unaffected as it can still send and receive BFD packets over the primary link. This results in all traffic sent over the failed member link getting dropped, till the LMM removes the failed link from the LAG.
In the second approach, BFD packets are sprayed over all the member links of a LAG. This is done naively via round-robin, where each BFD packet is sent using the subsequent member link, in a round-robin fashion. It solves the problem of BFD going down because of the primary port going down, but it still does not solve the problem of traffic getting lost when one of the member link goes down. This is because, when a member link goes down, BFD remains up and traffic continues to go over the link that has failed till a higher layer protocol detects this and removes the offending link from the LAG.
It is RECOMMENDED to implement this second approach due to it's deterministic behaviour.
The mechanism proposed for a fast detection of LAG member link failure is running BFD sessions on every LAG member link. We name this mode of BFD operations "BFD on LAG members" or "BLM" for short. It corresponds to section 7.3 in RFC 5882 [RFC5882].
The overall BLM session consists of the LAG interface, i.e. the aggregated link, a set of BFD sessions running on the member links and a new BFD state for the LAG; this state is explained in more detail in Section 3.3. We name the member-link sessions also micro BFD sessions; their details are discussed in Section 3.2.
The set of micro sessions is such that we have one micro session per member link. This set can change over the lifetime of a BLM session. E.g. BFD receives updates for the micro session set when links are physically added or removed from the LAG and will accordingly create or delete micro BFD sessions.
The details how the update happens are implementation specific and outside the document's scope. For example the client requesting the BLM session could provide these updates.
Per BLM session request only one address family can be used. I.e. the set of micro sessions belonging to the BLM session MUST either all use IPv4 or all use IPv6.
Multiple BLM session requests for the same LAG interface result in a shared BLM session. The set of micro sessions finally used is the superset of the individual micro session sets. If conflicting session parameters are requested then it is a local issue as to how to resolve the parameter conflicts, as explained in RFC 5882, Section 2.
A single micro BFD session runs on every member link of the LAG. These micro BFD sessions follow RFC 5880 [RFC5880].
Only asynchronous mode is considered in this document. The echo function is outside the document's scope. At least one system MUST take the Active role (possibly both). The micro BFD sessions on the member links are independent BFD sessions. They use their own unique, local discriminator values, maintain their own set of state variables and have their own independent state machine. Timer values MAY be different, even among the micro sessions belonging to the same LAG, although it is expected that micro sessions belonging to the same LAG use the same timer values.
The demultiplexing of a received packet is solely based on the Your Discriminator field, if this field is nonzero. For the initial Down packet of a micro session this value may be zero. In this case demultiplexing MUST be based on some combination of other fields which MUST include the interface information of the member link.
When receiving a BFD packet for a micro session with a valid, non-zero Your Discriminator then a check MUST be done if the packet was received on the correct member link interface. If the check fails then the packet MUST be discarded. This test needs to be done before state variables for the micro sessions are updated by the received packet.
The BFD packets for the micro session are IP/UDP encapsulated as defined in RFC 5881 [RFC5881]. Control packets for each micro BFD session use a well-known link-local multicast IP address (224.0.0.X for IPv4, FF02::X for IPv6, to be assigned by IANA).
On Ethernet-based LAG member links the corresponding destination multicast MACs will be 01:00:5e:00:00:XX for IPv4 and 33:33:00:00:00:XX for IPv6. Each member link uses its own MAC address as the source MAC address.
An additional state variable is introduced for BFD on LAG members: the concluded state. The state values are Down, Up and AdminDown. This state is not part of the micro session state machine. Instead it describes the overall state of the LAG. It is a local state and does not appear (directly) in any BFD packet on any link.
The concluded state may be set to AminDown for administrative purpose, to keep the BLM and the micro sessions indefinitely down. When the concluded state is entering AdminDown then all micro sessions belonging to the BLM MUST enter the AdminDown state as well.
A function must be defined, which evaluates all the states of the micro sessions that belong to the BLM. This function has two output values Down and Up and the concluded state is updated with the last evaluation result, unless it is already in AdminDown state. The evaluation takes place whenever a micro session is added, removed or is changing state.
The details of the evaluation function are outside the scope of the document. The function could for example test for a minimum number of micro sessions in Up state. The function could even be "outsourced" and e.g. the decision logic of the LMM module could be used.
The concluded state is important for layer-3 clients requesting BFD sessions over the LAG or over Vlans on the LAG. Details will be discussed in Section 4.
Layer 3 protocols like e.g. OSPF may use BFD on LAG members in one of the following ways:
This state update allows BBP session to run with more relaxed timer values as the more intense liveliness detection is done by the micro BFD sessions.
An implementation MUST provide a configuration knob which lets the user select the mode if both modes are supported.
There are certainly many ways to use BLM. Here is one example envisioned by the authors.
The LAG Management Module (LMM) could be envisaged as a client of BFD, requesting micro BFD sessions for all member links of the LAG. I.e. LMM is requesting a BLM session from BFD and takes responsibility to update the set of micro sessions when interfaces are administratively added or removed.
LMM uses BFD, instead of or in parallel with LACP, to monitor the health of the individual members links of the LAG. BFD takes a precedence over LACP in deciding the fate of individual member links when both are run in parallel.
A member link of the LAG is not used anymore for data forwarding when the associated micro BFD session running over that link goes down. The member link MUST be removed from the LAG. The BFD session for the link remains, i.e. it is not deleted.
To add a member link to the LAG, LMM MAY wait for the BFD session on the link to come Up. There may be a deadlock situation since the link interface not being active (e.g., layer 3 protocol down) may prevent BFD packets, including other control protocols packets (e.g. ARP) that are tightly coupled with the status of the interface, to be transmitted between the pair of interfaces, thus failing to bring up the interfaces.
To avoid the deadlock, BFD packets SHOULD NOT be blocked by the layer N protocol status of the interface when the application depends on the BFD status to enable layer N of the interface. If this cannot be achieved then the BFD status MUST be ignored by the application when bringing up an interface. The BFD status can then be used afterwards to bring the interface down.
The behaviour of the LMM MUST be configurable if waiting for BFD status of Up to add a member link is supported, to allow an alternative mode of adding the member link irrespective of the BFD state for interoperability purpose.
This document does not introduce any additional security issues and the security mechanisms defined in [RFC5880] apply in this document.
Routers compliant to this standard will now need to process packets addressed to a new multicast address. This however, should not open any new attack vector as it is a link local multicast and the attacker would have to be on the same link as the router to launch such packets.
The IANA is requested to assign a well-known link-local multicast IP address: "224.0.0.XXX" for IPv4 and FF02::X for IPv6.
Most of the text for this document came originally from draft-chen-bfd-interface-00.
We would like to thank Dave Katz, Alexander Vainshtein, Greg Mirsky and Jeff Tantsura for their comments on this draft.
We would also like to thank the members of the BFD WG who expressed strong support about the need to run BFD on all the member links of a LAG.
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. |
[RFC5880] | Katz, D. and D. Ward, "Bidirectional Forwarding Detection (BFD)", RFC 5880, June 2010. |
[RFC5881] | Katz, D. and D. Ward, "Bidirectional Forwarding Detection (BFD) for IPv4 and IPv6 (Single Hop)", RFC 5881, June 2010. |
[RFC5882] | Katz, D. and D. Ward, "Generic Application of Bidirectional Forwarding Detection (BFD)", RFC 5882, June 2010. |
[RFC1042] | Postel, J. and J. Reynolds, "Standard for the transmission of IP datagrams over IEEE 802 networks", STD 43, RFC 1042, February 1988. |
[RFC2328] | Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998. |
[RFC4271] | Rekhter, Y., Li, T. and S. Hares, "A Border Gateway Protocol 4 (BGP-4)", RFC 4271, January 2006. |
[this section will finally go away. It documents some of the discussions and decisions made recently on the BFD group list]
The destination IP address for the BFD control packets for the micro BFD sessions can be Unicast or Multicast. Each has its set of advantages and disadvantages.
Advantages with using a Unicast IP destination address:
Disadvantages with using a Unicast IP destination address:
Advantages with using a Multicast IP destination address:
Disadvantages with using a Multicast IP destination address:
Based on the above analysis, we decided to go with multicast IP addressing scheme for the micro BFD sessions.
[Either this section or the corresponding parts of Section 3.2 should remain in the final document. There is no intention to support both encapsulations]
The BFD packet is directly encapsulated into the Ethernet frame. The frame has the following format: Ethernet 802.3 header, then either:
In both cases the Ethernet payload must be padded with zeros to reach 46 bytes if the size is not already larger.
When receiving an Ethernet frame the payload, without any potential LLC/SNAP header, is used for further BFD processing. Additional padding data MUST be ignored if it was required to reach the minimum payload length of 46 bytes.
IANA needs to assign a L2 multicast address 01-80-C2-XX-XX-XX that would be used as the destination MAC for all control packets in the micro BFD sessions.
A new Ethertype must be assigned by the IEEE Registration Authority to the BFD over Ethernet protocol that will be used for all micro BFD sessions.
[Needs more detailed work TBD.]