MPLS | C. Villamizar, Ed. |
Internet-Draft | OCCNC |
Intended status: Informational | K. Kompella |
Expires: August 16, 2014 | Juniper Networks |
S. Amante | |
Apple Inc. | |
A.G. Malis | |
Huawei | |
C.M. Pignataro | |
Cisco | |
February 12, 2014 |
MPLS Forwarding Compliance and Performance Requirements
draft-ietf-mpls-forwarding-07
This document provides guidelines for implementers regarding MPLS forwarding and a basis for evaluations of forwarding implementations. Guidelines cover many aspects of MPLS forwarding. Topics are highlighted where implementers might otherwise overlook practical requirements which are unstated or under emphasized or are optional for conformance to RFCs but are often considered mandatory by providers.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on August 16, 2014.
Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
The initial purpose of this document was to address concerns raised on the MPLS WG mailing list about shortcomings in implementations of MPLS forwarding. Documenting existing misconceptions and potential pitfalls might potentially avoid repeating past mistakes. The document has grown to address a broad set of forwarding requirements.
The focus of this document is MPLS forwarding, base pseudowire forwarding, and MPLS Operations, Administration, and Maintenance (OAM). The use of pseudowire control word, and sequence number are discussed. Specific pseudowire Attachment Circuit (AC) and Native Service Processing (NSP) are out of scope. Specific pseudowire applications, such as various forms of Virtual Private Network (VPN), are out of scope.
MPLS support for multipath techniques is considered essential by many service providers and is useful for other high capacity networks. In order to obtain sufficient entropy from MPLS traffic service providers and others find it essential for the MPLS implementation to interpret the MPLS payload as IPv4 or IPv6 based on the contents of the first nibble of payload. The use of IP addresses, the IP protocol field, and UDP and TCP port number fields in multipath load balancing are considered within scope. The use of any other IP protocol fields, such as tunneling protocols carried within IP, are out of scope.
Implementation details are a local matter and are out of scope. Most interfaces today operate at 1 Gb/s or greater. It is assumed that all forwarding operations are implemented in specialized forwarding hardware rather than on a general purpose processor. This is often referred to as "fast path" and "slow path" processing. Some recommendations are made regarding implementing control or management plane functionality in specialized hardware or with limited assistance from specialized hardware. This advise is based on expected control or management protocol loads and on the need for denial of service (DoS) protection.
The following abbreviations are used.
This document is informational. The upper case [RFC2119] key words are not used in this document, except in the following cases.
Advice provided by this document may be ignored by implementations. Similarly, implementations not claiming conformance to specific RFCs may ignore the requirements of those RFCs. In both cases, implementers should consider the risk of doing so.
In early generations of forwarding silicon (which might now be behind us), there apparently were some misconceptions about MPLS. The following statements provide clarifications.
See
Section 2.3.
The following statements provide clarification regarding more recent requirements that are often missed.
This document is intended for multiple audiences: implementer (implementing MPLS forwarding in silicon or in software); systems designer (putting together a MPLS forwarding systems); deployer (running an MPLS network). These guidelines are intended to serve the following purposes:
The implementer, systems designer, and deployer have a transitive supplier customer relationship. It is in the best interest of the supplier to review their product against their customer's checklist and secondary customer's checklist if applicable.
This document identifies and explains many details and potential pit-falls of MPLS forwarding. It is likely that the identified set of potential pit-falls will later prove to be an incomplete set.
A brief review of forwarding issues is provided in the subsections that follow. This section provides some background on why some of these requirements exist. The questions to ask of suppliers is covered in Section 3. Some guidelines for testing are provided in Section 4.
Basic MPLS architecture and MPLS encapsulation, and therefore packet forwarding are defined in [RFC3031] and [RFC3032]. RFC3031 and RFC3032 are somewhat LDP centric. RSVP-TE supports traffic engineering (TE) and fast reroute, features that LDP lacks. The base document for RSVP-TE based MPLS is [RFC3209].
A few RFCs update RFC3032. Those with impact on forwarding include the following.
Tunneling encapsulations carrying MPLS, such as MPLS in IP [RFC4023], MPLS in GRE [RFC4023], MPLS in L2TPv3 [RFC4817], or MPLS in UDP [I-D.ietf-mpls-in-udp], are out of scope.
Other RFCs have implications to MPLS Forwarding and do not update RFC3032 or RFC3209, including:
A few RFCs update RFC3209. Those that are listed as updating RFC3209 generally impact only RSVP-TE signaling. Forwarding is modified by major extension built upon RFC3209.
RFCs which impact forwarding are discussed in the following subsections.
[RFC3032] specifies that label values 0-15 are special purpose labels with special meanings. [I-D.ietf-mpls-special-purpose-labels] renamed these from the term "reserved labels" used in [RFC3032] to "special purpose labels". Three values of NULL label are defined (two of which are later updated by [RFC4182]) and a router-alert label is defined. The original intent was that special purpose labels, except the NULL labels, could be sent to the routing engine CPU rather than be processed in forwarding hardware. Hardware support is required by new RFCs such as those defining entropy label and OAM processed as a result of receiving a GAL. For new special purpose labels, some accommodation is needed for LSR that will send the labels to a general purpose CPU or other highly programmable hardware. For example, ELI will only be sent to LSR which have signaled support for [RFC6790] and high OAM packet rate must be negotiated among endpoints.
[RFC3429] reserves a label for ITU-T Y.1711, however Y.1711 does not work with multipath and its use is strongly discouraged.
The current list of special purpose labels can be found on the "Multiprotocol Label Switching Architecture (MPLS) Label Values" registry reachable at IANA's pages at http://www.iana.org.
[I-D.ietf-mpls-special-purpose-labels] introduces an IANA "Extended Special Purpose MPLS Label Values" registry and makes use of the "extension" label, label 15, to indicate that the next label is an extended special purpose label and requires special handling. The range of only 16 values for special purpose labels allows a table to be used. The range of extended special purpose labels with 20 bits available for use may have to be handled in some other way in the unlikely event that in the future the range of currently reserved values 256-1048575 are used. If only the standards action range, 16-239, and the experimental range, 240-255, are used, then a table of 256 entries can be used.
Unknown special purpose labels and unknown extended special purpose labels are handled the same. When an unknown special purpose label is encountered or a special purpose label not directly handled in forwarding hardware is encountered, the packet should be sent to a general purpose CPU by default. If this capability is supported, there must be an option to either drop or rate limit such packets on a per special purpose label value basis.
[RFC2474] deprecates the IP Type of Service (TOS) and IP Precedence (Prec) fields and replaces them with the Differentiated Services Field more commonly known as the Differentiated Services Code Point (DSCP) field. [RFC2475] defines the Differentiated Services architecture, which in other forums, is often called a Quality of Service (QoS) architecture.
MPLS uses the Traffic Class (TC) field to support Differentiated Services [RFC5462]. There are two primary documents describing how DSCP is mapped into TC.
To meet Differentiated Services requirements specified in [RFC3270], the following forwarding requirements must be met. An ingress LER MUST be able to select an LSP and then apply a per LSP map of DSCP into TC. A midpoint LSR MUST be able to apply a per LSP map of TC to PHB. The number of mappings supported will be far less than the number of LSP supported.
To meet Differentiated Services requirements specified in [RFC4124], the following forwarding requirements must be met. An ingress LER MUST be able to select an LSP and then apply a per LSP map of DSCP into TC. A midpoint LSR MUST be able to apply a per LSP map to CT map and then use Class Type (CT) to map TC to PHB. Since there are only eight allowed values of CT, only eight maps of TC to PHB need to be supported. The LSP label can be used directly to find the TC to PHB mapping, as is needed to support [RFC3270] L-LSP.
While support for [RFC4124] and not [RFC3270] would allow support for only eight mappings of TC to PHB, it is common to support both and simply state a limit on the number of unique TC to PHB mappings which can be supported.
PTP or NTP may be carried over MPLS [I-D.ietf-tictoc-1588overmpls]. Generally NTP will be carried within IP with IP carried in MPLS [RFC5905]. Both PTP and NTP benefit from accurate time stamping of incoming packets and the ability to insert accurate time stamps in outgoing packets. PTP correction which occurs when forwarding requires updating a timestamp compensation field based on the difference between packet arrival at an LSR and packet transmit time at that same LSR.
Since the label stack depth may vary, hardware should allow a timestamp to be placed in an outgoing packet at any specified byte position. It may be necessary to modify layer-2 checksums or frame check sequences after insertion. PTP and NTP timestamp formats differ slightly. If NTP or PTP is carried over UDP/IP or UDP/IP/MPLS, the UDP checksum will also have to be updated.
Accurate time synchronization in addition to being generally useful is required for MPLS-TP delay measurement (DM) OAM. See Section 2.6.4.
MPLS deployments in the early part of the prior decade (circa 2000) tended to support either LDP or RSVP-TE. LDP was favored by some for its ability to scale to a very large number of PE devices at the edge of the network, without adding deployment complexity. RSVP-TE was favored, generally in the network core, where traffic engineering and/or fast reroute were considered important.
Both LDP and RSVP-TE are used simultaneously within major Service Provider networks using a technique known as "LDP over RSVP-TE Tunneling". This technique allows service providers to carry LDP tunnels inside RSVP-TE tunnels. This makes it possible to take advantage of the Traffic Engineering and Fast Re-Route on more expensive Inter-City and Inter-Continental transport paths. The ingress RSVP-TE PEs places many LDP tunnels on a single RSVP-TE LSP and carries it to the egress RSVP-TE PE. The LDP PEs are situated further from the core, for example within a metro network. LDP over RSVP-TE tunneling requires a minimum of two MPLS labels: one each for LDP and RSVP-TE.
The use of MPLS FRR [RFC4090] might add one more label to MPLS traffic, but only when FRR protection is in use (active). If LDP over RSVP-TE is in use, and FRR protection is in use, then at least three MPLS labels are present on the label stack on the links through which the Bypass LSP traverses. FRR is covered in Section 2.1.7.
LDP L2VPN, LDP IPVPN, BGP L2VPN, and BGP IPVPN added support for VPN services that are deployed by the vast majority of service providers. These VPN services added yet another label, bringing the label stack depth (when FRR is active) to four.
Pseudowires and VPN are discussed in further detail in Section 2.1.8 and Section 2.1.9.
MPLS hierarchy as described in [RFC4206] and updated by [RFC7074] can in principle add at least one additional label. MPLS hierarchy is discussed in Section 2.1.6.
Other features such as Entropy Label (discussed in Section 2.4.4) and Flow Label (discussed in Section 2.4.3) can add additional labels to the label stack.
Although theoretical scenarios can easily result in eight or more labels, such cases are rare if they occur at all today. For the purpose of forwarding, only the top label needs to be examined if PHP is used, a few more if UHP is used (see Section 2.5). For deep label stacks, quite a few labels may have to be examined for the purpose of load balancing across parallel links (see Section 2.4), however this depth can be bounded by a provider through use of Entropy Label.
MPLS Link Bundling was the first RFC to address the need for multiple parallel links between nodes [RFC4201]. MPLS Link Bundling is notable in that it tried not to change MPLS forwarding, except in specifying the "All-Ones" component link. MPLS Link Bundling is seldom if ever deployed. Instead multipath techniques described in Section 2.4 are used.
MPLS hierarchy is defined in [RFC4206] and updated by [RFC7074]. Although RFC4206 is considered part of GMPLS, the Packet Switching Capable (PSC) portion of the MPLS hierarchy are applicable to MPLS and may be supported in an otherwise GMPLS free implementation. The MPLS PSC hierarchy remains the most likely means of providing further scaling in an RSVP-TE MPLS network, particularly where the network is designed to provide RSVP-TE connectivity to the edges. This is the case for envisioned MPLS-TP networks. The use of the MPLS PSC hierarchy can add at least one additional label to a label stack, though it is likely that only one layer of PSC will be used in the near future.
Fast reroute is defined by [RFC4090]. Two significantly different methods are defined in RFC4090, the "One-to-One Backup" method which uses the "Detour LSP" and the " Facility Backup" which uses a "bypass tunnel". These are commonly referred to as the detour and bypass methods respectively.
The detour method makes use of a presignaled LSP. Hardware assistance is needed for detour FRR only if necessary to accomplish local repair of a large number of LSP within the 10s of milliseconds target. For each affected LSP a swap operation must be reprogrammed or otherwise switched over. The use of detour FRR doubles the number of LSP terminating at any given hop and will increase the number of LSP within a network by a factor dependent on the average detour path length.
The bypass method makes use of a tunnel that is unused when no fault exists but may carry many LSP when a local repair is required. There is no presignaling indicating which working LSP will be diverted into any specific bypass LSP. The merge LSR (egress LSR of the bypass LSP) MUST use platform label space (as defined in [RFC3031]) so that an LSP working path on any given interface can be backed up using a bypass LSP terminating on any other interface. Hardware assistance is needed if necessary to accomplish local repair of a large number of LSP within the 10s of milliseconds target. For each affected LSP a swap operation must be reprogrammed or otherwise switched over with an additional push of the bypass LSP label. The use of platform label space impacts the size of the LSR ILM for LSR with a very large number of interfaces.
The pseudowire (PW) architecture is defined in [RFC3985]. A pseudowire, when carried over MPLS, adds one or more additional label entries to the MPLS label stack. A PW Control Word is defined in [RFC4385] with motivation for defining the control word in [RFC4928]. The PW Associated Channel defined in [RFC4385] is used for OAM in [RFC5085]. The PW Flow Label is defined in [RFC6391] and is discussed further in this document in Section 2.4.3.
There are numerous pseudowire encapsulations, supporting emulation of services such as Frame Relay, ATM, Ethernet, TDM, and SONET/SDH over packet switched networks (PSNs) using IP or MPLS.
The pseudowire encapsulation is out of scope for this document. Pseudowire impact on MPLS forwarding at midpoint LSR is within scope. The impact on ingress MPLS push and egress MPLS UHP pop are within scope. While pseudowire encapsulation is out of scope, some advice is given on sequence number support.
Pseudowire (PW) sequence number support is most important for PW payload types with a high expectation of lossless and/or in-order delivery. Identifying lost PW packets and the exact amount of lost payload is critical for PW services which maintain bit timing, such as Time Division Multiplexing (TDM) services since these services MUST compensate lost payload on a bit-for-bit basis.
With PW services which maintain bit timing, packets that have been received out of order also MUST be identified and MAY be either re-ordered or dropped. Resequencing requires, in addition to sequence numbering, a "reorder buffer" in the egress PE, and ability to reorder is limited by the depth of this buffer. The down side of maintaining a large reorder buffer is added end-to-end service delay.
For PW services which maintain bit timing or any other service where jitter must be bounded, a jitter buffer is always necessary. The jitter buffer is needed regardless of whether reordering is done. In order to be effective, a reorder buffer must often be larger than a jitter buffer needs to be creating a tradeoff between reducing loss and minimizing delay.
PW services which are not timing critical bit streams in nature are cell oriented or frame oriented. Though resequencing support may be beneficial to PW cell and frame oriented payloads such as ATM, FR and Ethernet, this support is desirable but not required. Requirements to handle out of order packets at all vary among services and deployments. For example for Ethernet PW, occasional (very rare) reordering is usually acceptable. If the Ethernet PW is carrying MPLS-TP, then this reordering may be acceptable.
Reducing jitter is best done by an end-system, given that the tradeoff of loss vs delay varies among services. For example with interactive real time services low delay is preferred, while with non-interactive (one way) real time services low loss is preferred. The same end-site may be receiving both types of traffic. Regardless of this, bounded jitter is sometimes a requirement for specific deployments.
Packet reordering should be rare except in a small number of circumstances, most of which are due to network design or equipment design errors:
In provider networks which use multipath techniques and which may occasionally rebalance traffic or which may change PW paths occasionally for other reasons, reordering may be far more common than loss. Where reordering is more common than loss, resequencing packets is beneficial, rather than dropping packets at egress when out of order arrival occurs. Resequencing is most important for PW payload types with a high expectation of lossless delivery since in such cases out of order delivery within the network results in PW loss.
Layer-2 VPN [RFC4664] and Layer-3 VPN [RFC4110] add one or more label entry to the MPLS label stack. VPN encapsulations are out of scope for this document. Its impact on forwarding at midpoint LSR are within scope.
Any of these services may be used on an MPLS entropy label enabled ingress and egress (see Section 2.4.4 for discussion of entropy label) which would add an additional two labels to the MPLS label stack. The need to provide a useful entropy label value impacts the requirements of the VPN ingress LER but is out of scope for this document.
MPLS Multicast encapsulation is clarified in [RFC5332]. MPLS Multicast may be signaled using RSVP-TE [RFC4875] or LDP [RFC6388].
[RFC4875] defines a root initiated RSVP-TE LSP setup rather than leaf initiated join used in IP multicast. [RFC6388] defines a leaf initiated LDP setup. Both [RFC4875] and [RFC6388] define point to multipoint (P2MP) LSP setup. [RFC6388] also defined multipoint to multipoint (MP2MP) LSP setup.
The P2MP LSP have a single source. An LSR may be a leaf node, an intermediate node, or a "bud" node. A bud serves as both a leaf and intermediate. At a leaf an MPLS pop is performed. The payload may be a IP Multicast packet that requires further replication. At an intermediate node a MPLS swap operation is performed. The bud requires that both a pop operation and a swap operation be performed for the same incoming packet.
One strategy to support P2MP functionality is to pop at the LSR interface serving as ingress to the P2MP traffic and then optionally push labels at each LSR interface serving as egress to the P2MP traffic at that same LSR. A given LSR egress chip may support multiple egress interfaces, each of which requires a copy, but each with a different set of added labels and layer-2 encapsulation. Some physical interfaces may have multiple sub-interfaces (such as Ethernet VLAN or channelized interfaces) each requiring a copy.
If packet replication is performed at LSR ingress, then the ingress interface performance may suffer. If the packet replication is performed within a LSR switching fabric and at LSR egress, congestion of egress interfaces cannot make use of backpressure to ingress interfaces using techniques such as virtual output queuing (VOQ). If buffering is primarily supported at egress, then the need for backpressure is minimized. There may be no good solution for high volumes of multicast traffic if VOQ is used.
Careful consideration should be given to the performance characteristics of high fanout multicast for equipment that is intended to be used in such a role.
MP2MP LSP differ in that any branch may provide an input, including a leaf. Packets must be replicated onto all other branches. This forwarding is often implemented as multiple P2MP forwarding trees, one for each potential input interface at a given LSR.
While average packet size of Internet traffic may be large, long sequences of small packets have both been predicted in theory and observed in practice. Traffic compression and TCP ACK compression can conspire to create long sequences of packets of 40-44 bytes in payload length. If carried over Ethernet, the 64 byte minimum payload applies, yielding a packet rate of approximately 150 Mpps (million packets per second) for the duration of the burst on a nominal 100 Gb/s link. The peak rate for other encapsulations can be as high as 250 Mpps (for example IP or MPLS encapsulated using GFP over OTN ODU4).
It is possible that the packet rates achieved by a specific implementation is acceptable for a minimum payload size, such as 64 byte (64B) payload for Ethernet, but the achieved rate declines to an unacceptable level for other packet sizes, such as 65B payload. There are other packet rates of interest besides TCP ACK. For example, a TCP ACK carried over an Ethernet PW over MPLS over Ethernet may occupy 82B or 82B plus an increment of 4B if additional MPLS labels are present.
A graph of packet rate vs. packet size often displays a sawtooth. The sawtooth is commonly due to a memory bottleneck and memory widths, sometimes internal cache, but often a very wide external buffer memory interface. In some cases it may be due to a fabric transfer width. A fine packing, rounding up to the nearest 8B or 16B will result in a fine sawtooth with small degradation for 65B, and even less for 82B packets. A course packing, rounding up to 64B can yield a sharper drop in performance for 65B packets, or perhaps more important, a larger drop for 82B packets.
The loss of some TCP ACK packets are not the primary concern when such a burst occurs. When a burst occurs, any other packets, regardless of packet length and packet QoS are dropped once on-chip input buffers prior to the decision engine are exceeded. Buffers in front of the packet decision engine are often very small or non-existent (less than one packet of buffer) causing significant QoS agnostic packet drop.
Internet service providers and content providers at one time specified full rate forwarding with 40 byte payload packets as a requirement. Today, this requirement often can be waived if the provider can be convinced that when long sequence of short packets occur no packets will be dropped.
Many equipment suppliers have pointed out that the extra cost in designing hardware capable of processing the minimum size packets at full line rate is significant for very high speed interfaces. If hardware is not capable of processing the minimum size packets at full line rate, then that hardware MUST be capable of handling large burst of small packets, a condition which is often observed. This level of performance is necessary to meet Differentiated Services [RFC2475] requirements for without it, packets are lost prior to inspection of the IP DSCP field [RFC2474] or MPLS TC field [RFC5462].
With adequate on-chip buffers before the packet decision engine, an LSR can absorb a long sequence of short packets. Even if the output is slowed to the point where light congestion occurs, the packets, having cleared the decision process, can make use of larger VOQ or output side buffers and be dealt with according to configured QoS treatment, rather than dropped completely at random.
These on-chip buffers need not contribute significant delay since they are only used when the packet decision engine is unable to keep up, not in response to congestion, plus these buffers are quite small. For example, an on-chip buffer capable of handling 4K packets of 64 bytes in length, or 256KB, corresponds to 2 msec on a 10 Mb/s link and 0.2 usec on a 100 Gb/s link. If the packet decision engine is capable of handling packets at 90% of the full rate for small packets, then the maximum added delay is 0.2 msec and 20 nsec respectively, and this delay only applies if a 4K burst of short packets occurs. When no burst of short packets was being processed, no delay is added.
Packet rate requirements apply regardless of which network tier equipment is deployed in. Whether deployed in the network core or near the network edges, one of the two conditions MUST be met if Differentiated Services requirements are to be met:
In any large provider, service providers and content providers, hash based multipath techniques are used in the core and in the edge. In many of these providers hash based multipath is also used in the larger metro networks.
The Differentiated Services requirements for good reasons dictate that packets within a common microflow SHOULD NOT be reordered [RFC2474]. Service providers generally impose stronger requirements, commonly requiring that packets within a microflow MUST NOT be reordered except in rare circumstances such as load balancing across multiple links or path change for load balancing or path change for other reason.
The most common multipath techniques are ECMP applied at the IP forwarding level, Ethernet LAG with inspection of the IP payload, and multipath on links carrying both IP and MPLS, where the IP header is inspected below the MPLS label stack. In most core networks, the vast majority of traffic is MPLS encapsulated.
In order to support an adequately balanced load distribution across multiple links, IP header information must be used. Common practice today is to reinspect the IP headers at each LSR and use the label stack and IP header information in a hash performed at each LSR. Further details are provided in Section 2.4.5.
The use of this technique is so ubiquitous in provider networks that lack of support for multipath makes any product unsuitable for use in large core networks. This will continue to be the case in the near future, even as deployment of MPLS entropy label begins to relax the core LSR multipath performance requirements given the existing deployed base of edge equipment without the ability to add an entropy label.
A generation of edge equipment supporting the ability to add an MPLS entropy label is needed before the performance requirements for core LSR can be relaxed. However, it is likely that two generations of deployment in the future will allow core LSR to support full packet rate only when a relatively small number of MPLS labels need to be inspected before hashing. For now, don't count on it.
Common practice today is to reinspect the packet at each LSR and use information from the packet combined plus a hash seed that is selected by each LSR. Where flow labels or entropy labels are used, a hash seed must be used when creating these labels.
Within the core of a network some form of multipath is almost certain to be used. Multipath techniques deployed today are likely to be looking beneath the label stack for an opportunity to hash on IP addresses.
A pseudowire encapsulated at a network edge must have a means to prevent reordering within the core if the pseudowire will be crossing a network core, or any part of a network topology where multipath is used (see [RFC4385] and [RFC4928]).
Not supporting the ability to encapsulate a pseudowire with a control word may lock a product out from consideration. A pseudowire capability without control word support might be sufficient for applications that are strictly both intra-metro and low bandwidth. However a provider with other applications will very likely not tolerate having equipment which can only support a subset of their pseudowire needs.
Where multipath makes use of a simple hash and simple load balance such as modulo or other fixed allocation (see Section 2.4) the presence of large microflows that each consumes 10% of the capacity of a component link of a potentially congested composite link, one such microflow can upset the traffic balance and more than one can in effect reduce the effective capacity of the entire composite link by more than 10%.
When even a very small number of large microflows are present, there is a significant probability that more than one of these large microflows could fall on the same component link. If the traffic contribution from large microflows is small, the probability for three or more large microflows on the same component link drops significantly. Therefore in a network where a significant number of parallel 10 Gb/s links exists, even a 1 Gb/s pseudowire or other large microflow that could not otherwise be subdivided into smaller flows should carry a flow label or entropy label if possible.
Active management of the hash space to better accommodate large microflows has been implemented and deployed in the past, however such techniques are out of scope for this document.
Unlike a pseudowire control word, a pseudowire flow label [RFC6391], is required only for relatively large capacity pseudowires. There are many cases where a pseudowire flow label makes sense. Any service such as a VPN which carries IP traffic within a pseudowire can make use of a pseudowire flow label.
Any pseudowire carried over MPLS which makes use of the pseudowire control word and does not carry a flow label is in effect a single microflow (in [RFC2475] terms) and may result in the types of problems described in Section 2.4.2.
The MPLS entropy label simplifies flow group identification [RFC6790] at midpoint LSRs. Prior to the MPLS entropy label midpoint LSRs needed to inspect the entire label stack and often the IP headers to provide an adequate distribution of traffic when using multipath techniques (see Section 2.4.5). With the use of MPLS entropy label, a hash can be performed closer to network edges, placed in the label stack, and used by midpoint LSRs without fully reinspecting the label stack and inspecting the payload.
The MPLS entropy label is capable of avoiding full label stack and payload inspection within the core where performance levels are most difficult to achieve (see Section 2.3). The label stack inspection can be terminated as soon as the first entropy label is encountered, which is generally after a small number of labels are inspected.
In order to provide these benefits in the core, LSR closer to the edge must be capable of adding an entropy label. This support may not be required in the access tier, the tier closest to the customer, but is likely to be required in the edge or the border to the network core. LSR peering with external networks will also need to be able to add an entropy label on incoming traffic.
The most common multipath techniques are based on a hash over a set of fields. Regardless of whether a hash is used or some other method is used, the there is a limited set of fields which can safely be used for multipath.
If the "outer" or "first" layer of encapsulation is MPLS, then label stack entries are used in the hash. Within a finite amount of time (and for small packets arriving at high speed that time can be quite limited) only a finite number of label entries can be inspected. Pipelined or parallel architectures improve this, but the limit is still finite.
The following guidelines are provided for use of MPLS fields in multipath load balancing.
Apparently some chips have made use of the TC (formerly EXP) bits as a source of entropy. This is very harmful since it will reorder Assured Forwarding (AF) traffic [RFC2597] when a subset does not conform to the configured rates and is remarked but not dropped at a prior LSR. Traffic which uses MPLS ECN [RFC5129] can also be reordered if TC is used for entropy. Therefore, as stated in the guidelines above, the TC field (formerly EXP) MUST NOT be used in multipath load balancing as it violates Differentiated Services Ordered Aggregate (OA) requirements in these two instances.
Use of the MPLS label entry S bit would result in putting OAM traffic on a different path if the addition of a GAL at the bottom of stack removed the S bit from the prior label.
If an ELI label is found, then if the LSR supports entropy label, the EL label field in the next label entry (the EL) SHOULD be used and the search for additional entropy within the packet SHOULD be terminated. Failure to terminate the search will impact client MPLS-TP LSP carried within server MPLS LSP. A network operator has the option to use administrative attributes as a means to identify LSR which do not terminate the entropy search at the first EL. Administrative attributes are defined in [RFC3209]. Some configuration is required to support this.
If the label removed by a PHP pop is not used, then for any PW for which CW is used, there is no basis for multipath load split. In some networks it is infeasible to put all PW traffic on one component link. Any PW which does not use CW will be improperly split regardless of whether the label removed by a PHP pop is used. Therefore the PHP pop label SHOULD be used as recommended above.
Inspecting the IP payload provides the most entropy in provider networks. The practice of looking past the bottom of stack label for an IP payload is well accepted and documented in [RFC4928] and in other RFCs.
Where IP is mentioned in the document, both IPv4 and IPv6 apply. All LSRs MUST fully support IPv6.
When information in the IP header is used, the following guidelines apply:
This document makes the following recommendations. These recommendations are not required to claim compliance to any existing RFC therefore implementers are free to ignore them, but due to service provider requirements should consider the risk of doing so. The use of IP addresses MUST be supported and TCP and UDP ports (conditional on the protocol field and properly located) MUST be supported. The ability to disable use of UDP and TCP ports MUST be available. Though potentially very useful in some networks, it is uncommon to support using payloads of tunneling protocols carried over IP. Though the use of tunneling protocol header information is out of scope for this document, it is not discouraged.
The ingress to a pseudowire (PW) can extract information from the payload being encapsulated to create a flow label. [RFC6391] references IP carried in Ethernet as an example. The Native Service Processing (NSP) function defined in [RFC3985] differs with pseudowire type. It is in the NSP function where information for a specific type of PW can be extracted for use in a flow label. Which fields to use for any given PW NSP is out of scope for this document.
An entropy label is added at the ingress to an LSP. The payload being encapsulated is most often MPLS, a PW, or IP. The payload type is identified by the layer-2 encapsulation (Ethernet, GFP, POS, etc).
If the payload is MPLS, then the information used to create an entropy label is the same information used for local load balancing (see Section 2.4.5.1). This information MUST be extracted for use in generating an entropy label even if the LSR local egress interface is not a multipath.
Of the non-MPLS payload types, only payloads that are forwarded are of interest. For example, ARP is not forwarded and CNLP (used only for ISIS) is not forwarded.
The non-MPLS payload type of greatest interest are IPv4 and IPv6. The guidelines in Section 2.4.5.2 apply to fields used to create and entropy label.
The IP tunneling protocols mentioned in Section 2.4.5.2 may be more applicable to generation of an entropy label at edge or access where deep packet inspection is practical due to lower interface speeds than in the core where deep packet inspection may be impractical.
MPLS-TP introduces forwarding demands that will be extremely difficult to meet in a core network. Most troublesome is the requirement for Ultimate Hop Popping (UHP, the opposite of Penultimate Hop Popping or PHP). Using UHP opens the possibility of one or more MPLS pop operation plus an MPLS swap operation for each packet. The potential for multiple lookups and multiple counter instances per packet exists.
As networks grow and tunneling of LDP LSPs into RSVP-TE LSPs is used, and/or RSVP-TE hierarchy is used, the requirement to perform one or two or more MPLS pop operations plus a MPLS swap operation (and possibly a push or two) increases. If MPLS-TP LM (link monitoring) OAM is enabled at each layer, then a packet and byte count MUST be maintained for each pop and swap operation so as to offer OAM for each layer.
There are a number of situations in which packets are destined to a local address or where a return packet must be generated. There is a need to mitigate the potential for outage as a result of either attacks on network infrastructure, or in some cases unintentional misconfiguration resulting in processor overload. Some hardware assistance is needed for all traffic destined to the general purpose CPU that is used in MPLS control protocol processing or network management protocol processing and in most cases to other general purpose CPUs residing on an LSR. This is due to the ease of overwhelming such a processor with traffic arriving on LSR high speed interfaces, whether the traffic is malicious or not.
Denial of service (DoS) protection is an area requiring hardware support that is often overlooked or inadequately considered. Hardware assist is also needed for OAM, particularly the more demanding MPLS-TP OAM.
Modern equipment supports a number of control plane and management plane protocols. Generally no single means of protecting network equipment from denial of service (DoS) attacks is sufficient, particularly for high speed interfaces. This problem is not specific to MPLS, but is a topic that cannot be ignored when implementing or evaluating MPLS implementations.
Two types of protections are often cited as primary means of protecting against attacks of all kinds.
Some control and management protocols are often carried with payload traffic. This is commonly the case with BGP, T-LDP, and SNMP. It is often the case with RSVP-TE. Even when carried over G-ACh/GAL additional measures can reduce the potential for a minor breach to be leveraged to a full network attack.
Some of the additional protections are supported by hardware packet filtering.
The cryptographic authentication is generally the last resort in DoS attack mitigation. If a packet must be first sent to a general purpose CPU, then sent to a cryptographic engine, a DoS attack is possible on high speed interfaces. Only where hardware can identify a signature and the portion of packet covered by the signature is cryptographic authentication highly beneficial in protecting against DoS attacks.
For chips supporting multiple 100 Gb/s interfaces, only a very large number of parallel cryptographic engines can provide the processing capacity to handle a large scale DoS or distributed DoS (DDoS) attack. For many forwarding chips this much processing power requires significant chip real estate and power, and therefore reduces system space and power density. For this reason, cryptographic authentication is not considered a viable first line of defense.
For some networks the first line of defense is some means of supporting OOB control and management traffic. In the past this OOB channel might make use of overhead bits in SONET or OTN or a dedicated DWDM wavelength. G-ACh and GAL provide an alternative OOB mechanism which is independent of underlying layers. In other networks, including most IP/MPLS networks, perimeter filtering serves a similar purpose, though less effective without extreme vigilance.
A second line of defense is filtering, including GTSM. For protocols such as EBGP, GTSM and other filtering is often the first line of defense. Cryptographic authentication is usually the last line of defense and insufficient by itself to mitigate DoS or DDoS attacks.
[RFC4377] defines requirements for MPLS OAM that predate MPLS-TP. [RFC4379] defines what is commonly referred to as LSP Ping and LSP Traceroute. [RFC4379] is updated by [RFC6424] supporting MPLS tunnels and stitched LSP and P2MP LSP. [RFC4379] is updated by [RFC6425] supporting P2MP LSP. [RFC4379] is updated by [RFC6426] to support MPLS-TP connectivity verification (CV) and route tracing.
[RFC4950] extends the ICMP format to support TTL expiration that may occur when using IP traceroute within an MPLS tunnel. The ICMP message generation can be implemented in forwarding hardware, but if sent to a general purpose CPU must be rate limited to avoid a potential denial or service (DoS) attack.
[RFC5880] defines Bidirectional Forwarding Detection (BFD), a protocol intended to detect faults in the bidirectional path between two forwarding engines. [RFC5884] and [RFC5885] define BFD for MPLS. BFD can provide failure detection on any kind of path between systems, including direct physical links, virtual circuits, tunnels, MPLS Label Switched Paths (LSPs), multihop routed paths, and unidirectional links as long as there is some return path.
The processing requirements for BFD are less than for LSP Ping, making BFD somewhat better suited for relatively high rate proactive monitoring. BFD does not verify that the data plane matches the control plane, where LSP Ping does. LSP Ping is somewhat better suited for on-demand monitoring including relatively low rate periodic verification of data plane and as a diagnostic tool.
Hardware assistance is often provided for BFD response where BFD setup or parameter change is not involved and may be necessary for relatively high rate proactive monitoring. If both BFD and LSP Ping are recognized in filtering prior to passing traffic to a general purpose CPU, appropriate DoS protection can be applied (see Section 2.6.1). Failure to recognize BFD and LSP Ping and at least rate limit creates the potential for misconfiguration to cause outages rather than cause errors in the misconfigured OAM.
Pseudowire OAM makes use of the control channel provided by Virtual Circuit Connectivity Verification (VCCV) [RFC5085]. VCCV makes use of the Pseudowire Control Word. BFD support over VCCV is defined by [RFC5885]. [RFC5885] is updated by [RFC6478] in support of static pseudowires. [RFC4379] is updated by [RFC6829] supporting LSP Ping for Pseudowire FEC advertised over IPv6.
G-ACh/GAL (defined in [RFC5586]) is the preferred MPLS-TP OAM control channel and applies to any MPLS-TP end points, including Pseudowire. See Section 2.6.4 for an overview of MPLS-TP OAM.
[RFC6669] summarizes the MPLS-TP OAM toolset, the set of protocols supporting the MPLS-TP OAM requirements specified in [RFC5860] and supported by the MPLS-TP OAM framework defined in [RFC6371].
The MPLS-TP OAM toolset includes:
See Section 2.6.2 for discussion of hardware support necessary for BFD and LSP Ping.
CC-CV and alarm reporting is tied to protection and therefore SHOULD be supported in forwarding hardware in order to provide protection for a large number of affected LSP within target response intervals. Since CC-CV is supported by BFD, for MPLS-TP providing hardware assistance for BFD processing helps insure that protection recovery time requirements can be met even for faults affecting a large number of LSP.
MPLS-TP Protection State Coordination (PSC) is defined by [RFC6378] and updated by [I-D.ietf-mpls-psc-updates], correcting some errors in [RFC6378].
[RFC6670] provides the reasons for selecting a single MPLS-TP OAM solution and examines the consequences were ITU-T to develop a second OAM solution that is based on Ethernet encodings and mechanisms.
[RFC6310] and [RFC7023] specifies the mapping of defect states between many types of hardware Attachment Circuits (ACs) and associated Pseudowires (PWs). This functionality SHOULD be supported in forwarding hardware.
It is beneficial if an MPLS OAM implementation can interwork with the underlying server layer and provide a means to interwork with a client layer. For example, [RFC6427] specifies an inter-layer propagation of AIS and LDI from MPLS server layer to client MPLS layers. Where the server layer is a Layer-2, such as Ethernet, PPP over SONET/SDH, or GFP over OTN, interwork among layers is also beneficial. For high speed interfaces, supporting this interworking in forwarding hardware helps insure that protection based on this interworking can meet recovery time requirements even for faults affecting a large number of LSP.
Where certain requirements must be met, such as relatively high CC-CV rates and a large number of interfaces, or strict protection recovery time requirements and a moderate number of affected LSP, some OAM functionality must be supported by forwarding hardware. In other cases, such as highly accurate LM and DM OAM or strict protection recovery time requirements with a large number of affected LSP, OAM functionality must be entirely implemented in forwarding hardware.
Where possible, implementation in forwarding hardware should be in programmable hardware such that if standards are later changed or extended these changes are likely to be accommodated with hardware reprogramming rather than replacement.
For some functionality there is a strong case for an implementation in dedicated forwarding hardware. Examples include packet and byte counters needed for LM OAM as well as needed for management protocols. Similarly the capture and insertion of packet and byte counts or timestamps needed for transmitted LM or DM or time synchronization packets MUST be implemented in forwarding hardware if high accuracy is required.
For some functions there is a strong case to provide limited support in forwarding hardware but may make use of an external general purpose processor if performance criteria can be met. For example origination of RDI triggered by CC-CV, response to RDI, and Protection State Coordination (PSC) functionality may be supported by hardware, but expansion to a large number of client LSP and transmission of AIS or RDI to the client LSP may occur in a general purpose processor. Some forwarding hardware supports one or more on-chip general purpose processors which may be well suited for such a role. [I-D.ietf-mpls-psc-updates], being a very recent document that affects a protection state machine that requires hardware support, underscores the importance of having a degree of programmability in forwarding hardware.
The customer (system supplier or provider) should not dictate design, but should independently validate target functionality and performance. However, it is not uncommon for service providers and system implementers to insist on reviewing design details (under NDA) due to past experiences with suppliers and to reject suppliers who are unwilling to provide details.
Service provider networks may carry up to hundreds of millions of flows on 10 Gb/s links. Most flows are very short lived, many under a second. A subset of the flows are low capacity and somewhat long lived. When Internet traffic dominates capacity a very small subset of flows are high capacity and/or very long lived.
Two types of limitations with regard to number and size of flows have been observed.
The following questions should be asked of a supplier. These questions are grouped into broad categories. The questions themselves are intended to be an open ended question to the supplier. The tests in Section 4 are intended to verify whether the supplier disclosed any compliance or performance limitations completely and accurately.
See
Section 2.1.3.See
Section 2.1.5.See
Section 2.1.6 regarding MPLS hierarchy. See [RFC3443] regarding PHP, UHP, and pipe, short-pipe, and uniform models.See
Section 2.2.
Specify circumstances (such as specific features enabled or specific types of packet processing) often impact these rates and burst sizes.
Multipath capabilities and performance do not apply to MPLS-TP but apply to MPLS and apply if MPLS-TP is carried in MPLS.
See
Section 2.6.2.
Packet rate performance of equipment supporting a large number of 10 Gb/s or 100 Gb/s links is not possible using desktop computers or workstations. The use of high end workstations as a source of test traffic was barely viable 20 years ago, but is no longer at all viable. Though custom microcode has been used on specialized router forwarding cards to serve the purpose of generating test traffic and measuring it, for the most part performance testing will require specialized test equipment. There are multiple sources of suitable equipment.
The set of tests listed here do not correspond one-to-one to the set of questions in Section 3. The same categorization is used and these tests largely serve to validate answers provided to the prior questions, and can also provide answers where a supplier is unwilling to disclose compliance or performance.
Performance testing is the domain of the IETF Benchmark Methodology Working Group (BMWG). Below are brief descriptions of conformance and performance tests. Some very basic tests are specified in [RFC5695] which partially cover only the basic performance test T#3.
The following tests should be performed by the systems designer, or deployer, or performed by the supplier on their behalf if it is not practical for the potential customer to perform the tests directly. These tests are grouped into broad categories.
The tests in Section 4.1 should be repeated under various conditions to retest basic performance when critical capabilities are enabled. Complete repetition of the performance tests enabling each capability and combinations of capabilities would be very time intensive, therefore a reduced set of performance tests can be used to gauge the impact of enabling specific capabilities.
Multipath capabilities do not apply to MPLS-TP but apply to MPLS and apply if MPLS-TP is carried in MPLS.
See
Section 2.6.1.
Numerous very useful comments have been received in private email. Some of these contributions are acknowledged here, approximately in chronologic order.
Paul Doolan provided a brief review resulting in a number of clarifications, most notably regarding on-chip vs. system buffering, 100 Gb/s link speed assumptions in the 150 Mpps figure, and handling of large microflows. Pablo Frank reminded us of the sawtooth effect in PPS vs. packet size graphs, prompting the addition of a few paragraphs on this. Comments from Lou Berger at IETF-85 prompted the addition of Section 2.7.
Valuable comments were received on the BMWG mailing list. Jay Karthik pointed out testing methodology hints that after discussion were deemed out of scope and were removed but may benefit later work in BMWG.
Nabil Bitar pointed out the need to cover QoS (Differentiated Services), MPLS multicast (P2MP and MP2MP), and MPLS-TP OAM. Nabil also provided a number of clarifications to the questions and tests in Section 3 and Section 4.
Mark Szczesniak provided a thorough review and a number of useful comments and suggestions that improved the document.
Gregory Mirsky and Thomas Beckhaus provided useful comments during the MPLS RT review.
Tal Mizrahi provided comments that prompted clarifications regarding timestamp processing, local delivery of packets, and the need for hardware assistance in processing OAM traffic.
Alexander (Sasha) Vainshtein pointed out errors in Section 2.1.8.1 and suggested new text which after lengthy discussion resulted in restating the summarization of requirements from PWE3 RFCs and more clearly stating the benefits and drawbacks of packet resequencing based on PW sequence number.
Loa Anderson provided useful comments and corrections prior to WGLC. Adrian Farrel provided useful comments and corrections prior as part of the AD review.
Discussion with Steve Kent during SecDir review resulted in expansion of Section 7, briefly summarizing security considerations related to forwarding in normative references. Tom Petch pointed out some editorial errors in private email. Al Morton during OpsDir review prompted clarification in the target audience section, suggested more clear wording in places, and found numerous editorial errors.
This memo includes no request to IANA.
This document reviews forwarding behavior specified elsewhere and points out compliance and performance requirements. As such it introduces no new security requirements or concerns.
Discussion of hardware support and other equipment hardening against DoS attack can be found in Section 2.6.1. Section 3.6 provides a list of question regarding DoS to be asked of suppliers. Section 4.6 suggests types of testing that can provide some assurance of the effectiveness of supplier DoS hardening claims.
Knowledge of potential performance shortcomings may serve to help new implementations avoid pitfalls. It is unlikely that such knowledge could be the basis of new denial of service as these pitfalls are already widely known in the service provider community and among leading equipment suppliers. In practice extreme data and packet rate are needed to affect existing equipment and to affect networks that may be still vulnerable due to failure to implement adequate protection. The extreme data and packet rates make this type of denial of service unlikely and make undetectable denial of service of this type impossible.
The set of normative references each contain security considerations. A brief summarization of MPLS security considerations applicable to forwarding follows:
MPLS security including data plane security is discussed in greater detail in [RFC5920] (MPLS/GMPLS Security Framework). The MPLS-TP security framework [RFC6941] build upon this, focusing largely on the MPLS-TP OAM additions and OAM channels with some attention given to using network management in place of control plane setup. In both security framework documents MPLS is assumed to run within a "trusted zone", defined as being where a single service provider (SP) has total operational control over that part of the network.
If control plane security and management plane security are sufficiently robust, compromise of a single network element may result in chaos in the data plane anywhere in the network through denial of service attacks, but not a Byzantine security failure in which other network elements are fully compromised.
MPLS security, or lack of, can affect whether traffic can be misrouted and lost, or intercepted, or intercepted and reinserted (a man-in-the-middle attack) or spoofed. End user applications, including control plane and management plane protocols used by the SP, are expected to make use of appropriate end-to-end authentication and where appropriate end-to-end encryption.
The References section is split into Normative and Informative subsections. References that directly specify forwarding encapsulations or behaviors are listed as normative. References which describe signaling only, though normative with respect to signaling, are listed as informative. They are informative with respect to MPLS forwarding.