Network Working Group | H. Chen |
Internet-Draft | China Telecom |
Intended status: Informational | Z. Li |
Expires: September 12, 2019 | China Mobile |
F. Xu | |
Tencent | |
Y. Gu | |
Z. Li | |
Huawei | |
March 11, 2019 |
Network-wide Protocol Monitoring (NPM): Use Cases
draft-chen-npm-use-cases-00
As networks continue to scale, we need a coordinated effort for diagnosing control plane health issues in heterogeneous environments. Traditionally, operators developed internal solutions to address the identification and remediation of control plane health issues, but as networks increase in size, speed and dynamicity, new methods and techniques will be required.
This document highlights key network health issues, as well as network planning requirements, identified by leading network operators. It also provides an overview of current art and techniques that are used, but highlights key deficiencies and areas for improvement.
This document proposes a unified management framework for coordinating diagnostics of control plane problems and optimization of network design. Furthermore, it outlines requirements for collecting, storing and analyzing control plane data, to minimise or negate control plane problems that may significantly affect overall network performance and to optimize path/peering/policy planning for meeting application-specific demands.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 12, 2019.
Copyright (c) 2019 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
Recently, significant effort has been made to evolve control network resources, using management plane enhancements and control of network state via centralized and distributed control plane methods. There is ongoing effort in the diagnosing of forwarding plane performance degradation, using telemetry-based solutions and in-band data plane OAM. However, less emphasis has been applied on the diagnosing and remediation of health problems related to optimal control of network resources, and diagnosing control plane health issues.
The document outlines the existing set of standards-based tools and highlights the lack of capability for addressing control plane monitoring.
The concept of network telemetry has been proposed to meet the current and future OAM demands, supporting real-time data collection, process, exportation, and analysis, and an architectural framework of existing Telemetry approaches is introduced in [I-D.song-ntf]. Network telemetry provides visibility to the network health conditions, and is beneficial for faster network troubleshooting, network OpEx (operating expenditure) reduction, and network optimization. Telemetry can be applied to the data plane, control plane and management plane. There have been various methods proposed for each plane:
The above mentioned telemetry approaches may vary in data type and form, including: encapsulation, serialization, transportation,subscription, and data analysis, thus resulting in various applications. With the network operations and maintenance evolving towards automation and intent-driven, higher requirements are set for each plane. Healthy management plane and control plane are essential for high-quality data service provisioning. The visibility of management and control planes' healthiness provides insights for changes in the data plane.
First of all, the running of control protocols aims to provide and guarantee the network connectivity and reachability, which is the foundation of any data service running above it. The monitoring of the control plane detects the healthiness issue in real time so that immediate troubleshooting actions can be taken, and thus mitigating the affect on data services as much as possible.
Secondly, without route analytics, the dynamic nature of IP networking makes it virtually impossible to know at any time point how traffic is traversing the networks. For example, by collecting real-time BGP routes through BMP and correlating them with traffic data retrieved through data plane telemetry, the operator is able to provide both inter-domain and intra-domain traffic optimization.
Finally, the validation and evaluation of route policies is another common appeal from both carriers and OTTs. The difficulty here majorly lies in the precise definition of the correctness of policies. In other words, the policy validation depends largely on the operator's understanding and manual judgement of the current network status instead of formatted and quantitive command executed at devices. Thus, it demands visualized presentations of how the policies impact the route changes through control plane telemtry so that operators may have direct judgement of the policy correctness. The conventional separated data collections of route policy and route information is not sufficient for the correctness validation of route policy.
Based on discussions with leading operators, this document identifies the challenges and problems that the current control plane telemetry faces and suggests the data collection requirements. The necessity for a Network-wide Protocol Monitoring (NPM) framework is illustrated and conducted through the discussion of specific use cases.
IGP: Interior Gateway Protocol
IS-IS: Intermediate System to Intermediate System
BGP: Boarder Gateway Protocol
BGP-LS: Boarder Gateway Protocol-Link State
MPLS: Multi-Protocol Label Switching
RSVP-TE: Resource Reservation Protocol-Traffic Engineering
LDP: Label Distribution Protocol
NPM: Network-wide Protocol Monitoring
NPMS: Network Protocol Monitoring System
BMP: BGP Monitoring Protocol
LSP: Link State Packet
SDN: Software Defined Network
IPFIX: Internet Protocol Flow Information Export
According to Huawei 2016 network issue statistics, about 48% issues of the total amount are routing protocol-related, including protocol adjacency/peer set up failure, adjacency/peer flapping, protocol-related table error. What's more, the routing protocol issues are not standalone, which simultaneously come with anomaly status in data plane, and are finally reflected on poor service quality and user experience.
Existing methods for protocol troubleshooting include CLI, SNMP, Netconf-YANG/gRPC-YANG and vendor-specific/third party tools.
Using CLI to do per-device check provides adequate per device information, but lacks network-wide vision, thus leading to either massive labor/time consumption checking all devices or fail to localize the source. Besides, complex CLI usage (combination and repeat pattern) requires experience from the NOC person.
Management protocols, like SNMP, Netconf/gRPC, provide information already/to be gathered from the network, which reduces operational complexity, but sacrifices data adequacy compared with CLI. Since the above protocols aren't designed specifically for routing troubleshooting, not all the data source required is currently supported for exportation, and the lack of certain data becomes the troubleshooting bottleneck. For example, in an LSP purge abnormal case caused by continuous corrupted LSP, it's useful to collect the corrupted LSP PDUs for root cause analysis. In addition, for the currently supported, as well as to be supported, data source collection, the data synchronization issue, due to export performance difference of various approaches, can be a concern for data correlation. The data collection requirements depend largely on the use cases, and more details are discussed in Section 5.
Some third party OAM tools provide troubleshooting-customized information collection and analysis. For example, Packet Design uses passive listening to collect IS-IS/OSPF/BGP messages to do route analysis for troubleshooting and path optimization. Such passive listening lacks per-device information collection. For example, to detect the existence of a route loop and analyze the root cause, it not only requires the network-wide RIB/FIB collection, but also requires the route policy information that is responsible for the generation of loop issue.
To summarize here, the currently protocols and tools do not provide sufficient data source for routing troubleshooting. There requires new methods or augmented work to existing methods to enhance the control plane data collection and to support more efficient data correlation.
The dynamic nature of IP networks, e.g., peer up/down, prefix advertisement, route change, and so on, has great influence on the service provisioning. With the emerging of new network services, such as automated driving systems, AR (Augmented Reality), and so on, network planning is facing new requirements in order to meet the latency, bandwidth and security demands. The requirements can generally break into two perspectives: 1. sufficient and up-to-date routing data collection as the input for network simulation; 2. accurate what-if simulation to evaluate new network planning actions.
Most existing control plane and data plane simulation tools, e.g., Batfish, use device configurations to generate a control/data plane. There exists some concerns w.r.t. such simulation method: 1. in a multi-vendor network understanding and translating the configuration files is a non-trivial task for the simulator; 2. the generated control/control plane is not the 100% mirroring of the actual network, and thus resulting in less accurate simulation results. Thus, it requires real-time routing data collection from the on-going network. Currently, BGP routes and peering states are monitored in real-time by using BMP. However, IS-IS/OSPF/MPLS routing data still lacks legitimate and comprehensive monitoring. Here, not only the data coverage, including RIB/FIB, network topology, peering states and so on, but also the data synchronization of various devices should be considered in order to recover a faithful data/control plane within the simulator.
With the above mentioned challenges facing the control plane telemetry, it is of great value to identify the requirements from typical use cases, and the gaps between the requirements and existing methods. It is thus necessary to propose a comprehensive control plane telemetry framework, as shown in Figure 1.
+-------------+ +----->+ NPM Server +<-----+ | +------+------+ | | ^ | | + | | BMP,gRPC,Netconf, | | BGP+LS,new|protocol?: | | topology,protocol PDU, | | RIB,route policy, | | statistics... | ****|*************|*************|***** * | | | AS0* * | +--+--+ | * * | +...+-->+ R 3 +<--+...+ | * * | | +-----+ | | * * | | | | * * | + | | * * | ISIS/OSPF/BGP/ | | * * | MPLS/SR... | | * * | + + | * * | | ISIS/OSPF/BGP/| * * | | MPLS/SR... + | * * | | | | * BGP/MPLS* | v ISIS/OSPF/BGP/ v | *BGP/MPLS +-----+ /SR * ++--+-+ MPLS/SR... +-+--++ */SR +-----+ | AS1 +<---------->+ R 1 +<-----+...+----->+ R 2 +<---------->+ AS2 | +-----+ * +-----+ +-----+ * +-----+ * * ************************************** Figure 1: NPM framework
More specific requirements may vary case by case, but it is a common appeal to guarantee a valid tunnel and adequate data collection.
Data Source: Topology,protocol PDU, NPM problem space: RIB,route policy, suffcient data type coverage, statistics... sufficient device coverage +--------+----------+ | v +-------------+-------------+ | Data Generation: | NPM problem space: | data encapsulation, | data model definition, | data serialization, | data process efficiency | data subscription | +-------------+-------------+ | v +-------------+-------------+ | Data Transportation: | NPM problem space: | | Transportation protocol | BMP, gRPC, Netconf, | selection, | BGP-LS, new protocol? | exportation efficiency +-------------+-------------+ | v +-------------+-------------+ | Data Analysis: | | Protocol troubleshooting, | NPM problem space: | Policy validation, | data synchronization, | Traffic optimization, | data parse efficiency | What-if simulation | +---------------------------+ Figure 2: NPM problem space
We have identified several typical routing issues that occur frequently in the network, and are typically hard to localize.
The IS-IS Route Flapping refers to the situation that one or more routes appear and then disappear in the routing table repeatedly. Route flapping usually comes with massive PDUs interactions (e.g., LSP, LSP purge...), which consume excessive network bandwidth, and excessive CPU processing. In addition, the impact is often network-wide. The localizing of the flapping source and the identifying of root causes haven't been easy work due to various reasons.
The flapping can be caused by system ID conflict, IS-IS neighborship flapping, route source flapping (caused by import route policy misconfiguration) , device clock dis-function with abnormal LSP purge (e.g., 100 times faster) and so on.
During the IS-IS flooding, sometimes the LSP synchronization failure happens. The synchronization failure causes can be generally classified into three cases:
With sufficient ISIS PDU related statistics and parsed PDU information recorded at the device, the neighborship failure in Case 2 can be typically diagnosed at Router A or Router B independently. With such diagnosing information collected (e.g., in the format of reason code) in real-time, the server can identify the root synchronization issue with much less time and labor consumption compared with conventional methods. In Case 1 & 3, the failure is mostly caused by incorrect route policy and software/hardware issue. By comparing the LSDB with the sent/received LSP, differences can be recognized. Then the difference may further guide the localization of the root cause. Thus, by collecting the LSDBs and sent/received LSPs from the two affected neighbors, the server can have more insights at the synchronization failure.
Incorrect import policy, such as incorrect protocol priority (distance) or improper default route configuration, may result in a route loop. TTL anomaly report or packet loss complain triggers loop alarm. However, locating the exact device(s) and more importantly the responsible configuration/policy is definitely non-trivial work. The generation of routing information base/forwarding information base (RIB/FIB) is related to various protocols and massive route policies, which often makes it hard to locate the loop source in a timely manner.
If the network-wide RIB/FIB data can be collected in real-time, the server is able to run loop detection algorithms to detect and locate the loop. More importantly, with real-time RIB/FIB collected as the input for network simulator, loop can be predicted with what-if simulations of network changes, such as new policy, or link failure.
The MPLS label switch path set up, either using RSVP-TE or LDP, may fail due to various reasons. Typical troubleshooting procedures are to log in the device, and then check if the failure lies on the configuration, or path computation error, or link failure. Sometimes, it requires the check of multiple devices along the tunnel. Certain reason codes can be carried in the Path-Err/ResvErr messages of RSVP-TE, while other data are currently not supported to be transmitted to the path ingress/egress node, such as the authentication failure. In this case, if the tunnel configurations of devices along the tunnel, as well as the link states, and other reasons diagnosed by each device can be collected centrally, the server is able to do a thorough analysis and find the root cause.
Monitoring and analyzing the network routing events not only help identify the root causes of network issues, but also provide visibility of how routing changes affect network traffic. With the benefit of data plane telemetry, such as iOAM and IPFIX, network traffic matrices can be generated to give a glance of the current network performance. More specifically, traffic matrices visualize the current and historical network changes, such as link utilization, link delay, jitter, and so on. While traffic matrices provide "what" are the network changes, the control plane event monitoring, such as adjacency/peering failure, route flapping, prefix advertize/withdraw, provides "why".
Route policy validation has been a great concern for operators when implementing new policies as well as optimizing existing pollicies. Validation comes in two perspectives:
TBD
TBD
TBD