Network Telemetry Framework
draft-song-opsawg-ntf-01

Abstract

This document provides an architectural framework for network telemetry to meet the current and future network operation requirements. The defining characteristics of network telemetry show a clear distinction from the conventional network Operations, Administration, and Management (OAM) concept; hence network telemetry requires new procedures, methods, and protocols. This document clarifies the terminologies and classifies the categories and components of a network telemetry framework. The requirements, challenges, existing solutions, and future directions are discussed for each category. The network telemetry framework and the taxonomy help to set a common ground for the collection of related works and put future technique and standard developments into perspective.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on April 22, 2019.

Copyright Notice

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Introduction

1.1. Requirements Language

2. Motivation

2.1. Use Cases
2.2. Challenges
2.3. Glossary
2.4. Network Telemetry

3. The Necessity of a Network Telemetry Framework
4. Network Telemetry Framework

4.1. Existing Works Mapped in the Framework
4.2. Management Plane Telemetry

4.2.1. Requirements and Challenges
4.2.2. Push Extensions for NETCONF
4.2.3. gRPC Network Management Interface

4.3. Control Plane Telemetry

4.3.1. Requirements and Challenges
4.3.2. BGP Monitoring Protocol

4.4. Data Plane Telemetry

4.4.1. Requirements and Challenges
4.4.2. Technique Taxonomy
4.4.3. The IPFPM technology
4.4.4. Dynamic Network Probe
4.4.5. IP Flow Information Export (IPFIX) protocol
4.4.6. In-Situ OAM

4.5. External Data and Event Telemetry

4.5.1. Requirements and Challenges

5. Evolution of Network Telemetry
6. Security Considerations
7. IANA Considerations
8. Contributors
9. Acknowledgments
10. References

10.1. Normative References
10.2. Informative References

Authors' Addresses

1. Introduction

Network visibility is essential for network operation. Network telemetry has been widely accepted as the ideal mean to gain full network visibility. However, there are still confusion and misunderstandings about the connotation of network telemetry. We need an unambiguous understanding of the concept so we can better align the related technology and standard developments.

First, we show some key characteristics of network telemetry which set a clear distinction from the conventional network OAM. We then provide an architectural framework for network telemetry to meet the current and future network operation requirements. Following the framework, we classify the components of a network telemetry system so we can esily map the exising and emerging techniques and protocols into the framework. The requirements, challenges, existing solutions, and future directions are discussed for each framework category. At last, we outline a roadmap for the evolution of the network telemetry system.

The network telemetry framework and the taxonomy help to set a common ground for the collection of related works and put future technique and standard developments into perspective.

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119][RFC8174] when, and only when, they appear in all capitals, as shown here.

2. Motivation

The advance of Artifical Intelligence (AI), and specifically Machine Learning (ML), technologies gives networks an unprecedented opportunity to realize network autonomy with closed control loops. An intent-driven autonomous network is the logical next step for network evolution following Software Defined Network (SDN), aiming to reduce (or even eliminate) human labor, make the most efficient use of network resources, and provide better services more aligned with customer requirements. Although we still have a long way to reach the ultimate goal, the machine automation journey has started nevertheless.

The storage and computing technologies are already mature enough to be able to retain and process a huge amount of data and make real-time inference. Tools based on machine learning technologies and big data analytics are powerful in detecting and reacting on network faults, anomalies, and policy violations. In turn, the network policy updates for planning, intrusion prevention, optimization, and self-healing may be applied. Tools exist that will profile, classify, and predict future events based on historical data trends. However, to increase the accuracy of these preditive capabilities, and better support autonomous networking, improvements must be made. The current network architecture, protocol suite, and system design are not ready yet to provide enough quality data.

In the remaining of this section, first we identify the key network operation use cases that network operators need the most. These use cases are also the essential functions of the future autonomous networks. Next, we show why the current network OAM techniques and protocols are not sufficient to meet the requirements of these use cases. The discussion underlines the need for new methods, techniques, and protocols which we may assign under an umbrella term, Network Telemetry.

2.1. Use Cases

The use cases highlighted use data extracted from the network data plane, as well as control plane and management plane.

Intent and Policy Compliance:: Network policies are the rules that constraint the services for network access, provide differentiate within a service, or enforce specific treatment on the traffic. For example, a service function chain is a policy that requires the selected flows to pass through a set of network functions in order. An intents is a high-level abstract policy which requires a complex translation and mapping process before being applied on networks. While a policy is enforced, the compliance needs to be verified and monitored continuously.
SLA Compliance:: A Service-Level Agreement (SLA) defines the level of service a user expects from a network operator, which include the metrics for the service measurement and remedy/penalty procedures when the service level misses the agreement. Users need to check if they get the service as promised and network operators need to evaluate how they can deliver the services that can meet the SLA.
Root Cause Analysis:: Network failure often involves a sequence of chained events and the source of the failure is not straightforward to identify, especially when the failure is sporadic. While machine learning or other data analytics technologies can be used for root cause analysis, it up to the network to provide all the relevant data for analysis.
Load Balancing, Traffic Engineering, and Network Planning:: Network operators are motivated to optimize their network utilization for better ROI or lower CAPEX, as well as differentiation across services and/or users of a given service. The first step is to know the real-time network conditions before applying policies to steer the user traffic or adjust the load balancing algorithm. In some cases network micro-bursts need to be detected in a very short time-frame so that fine-grained traffic control can be applied to avoid possible network congestion. The long-term network capacity planning and topology augmentation also rely on the accumulated data of the network operation.
Event Tracking and Prediction:: Network path and performance visibility is critical for healthy network operation. Numerous network events are of interest to network operators. For example, Network operators always want to learn where and why packets are dropped for an application flow. They also want to be warned of issues while proactive action may still be taken before an issue becomes a catastrophic problem, such as a component failure.

2.2. Challenges

The conventional OAM techniques, as described in [RFC7276], are not sufficient to support the above use cases for the following reasons:

Most use cases need to continuously monitor the network and dynamically refine the data collection in real-time and interactively. The poll-based low-frequency data collection is ill-suited for these applications. Streaming data directly pushed from the data source is preferred.
Various data is needed from any place ranging from the packet processing engine to the QoS traffic manager. Traditional data plane devices cannot provide the necessary probes. An open and programmable data plane is therefore needed.
Many application scenarios need to correlate data from multiple sources (e.g., from distributed nodes or from different network plane). A piecemeal solution is often lacking the capability to consolidate the data from multiple sources. The composition of a complete solution, as partly proposed by Autonomic Resource Control Architecture(ARCA), will be empowered and guided by a comprehensive framework.
The passive measurement techniques can either consume too much network resources and render too much redundant data, or lead to inaccurate results. The active measurement techniques are indirect, and they can interfere with the user traffic. We need techniques that can collect direct and on-demand data from user traffic.

2.3. Glossary

Before further discussion, we list some key terminology and acronyms used in this documents. We make an intended distinction between network telemetry and network OAM.

AI:: Artificial Intelligence. Use machine-learning based technologies to automate network operation.
BMP:: BGP Monitoring Protocol
DNP:: Dynamic Network Probe
DPI:: Deep Packet Inspection
gNMI:: gRPC Network Management Interface
gRPC:: gRPC Remote Procedure Call
IDN:: Intent-Driven Network
IPFIX:: IP Flow Information Export Protocol
IPFPM:: IP Flow Performance Measurement
IOAM:: In-situ OAM
NETCONF:: Network Configuration Protocol
Network Telemetry:: A general term for a new brood of network visibility techniques and protocols, with the characteristics defined in this document. Network telemetry enables smooth evolution toward intent-driven autonomous networks.
NMS:: Network Management System
OAM:: Operations, Administration, and Maintenance. A group of network management functions that provide network fault indication, fault localization, performance information, and data and diagnosis functions. Most conventional network monitoring techniques and protocols belong to network OAM.
SNMP:: Simple Network Management Protocol
YANG:: A data modeling language for NETCONF
YANG FSM:: A YANG model to define device side finite state machine
YANG PUSH:: A method to subscribe pushed data from remote YANG datastore

2.4. Network Telemetry

For a long time, network operators have relied upon SNMP or Command-Line Interface (CLI) to monitor the network. SNMP and CLI can access limited Management Information Base (MIB) information from the mangement plane. Most existing implementatons are mainly poll-based and supports low data rate with low timing accuracy. Such issues make SNMP and CLI insufficient for today and tomorrow's network operations.

Network telemetry has emerged as a mainstream technical term to refer to the newer techniques of data collection and consumption, distinguishing itself form the convention techniques for network OAM. The representative techniques and protocols include IPFIX and gPRC. SNMP is also envolving to support event notifications [RFC2981][RFC3877]. It is expected that network telemetry can provide the necessary network visibility for autonomous networks, address the shortcomings of conventional OAM techniques, and allow for the emergence of new techniques bearing certain characteristics.

One key difference between the network telemetry and the network OAM is that the network telemetry assumes an intelligent machine in the center of a closed control loop, while the network OAM assumes the human network operators in the middle of an open control loop. The network telemetry can directly trigger the automated network operation; The conventional OAM tools only help human operators to monitor and diagnose the networks and guide manual network operations. The different assumptions lead to very different techniques.

Although the network telemetry techniques are just emerging and subject to continuous evolution, several defining characteristics of network telemetry have been well accepted:

Push and Streaming: Instead of polling data from network devices, the telemetry collector subscribes to the streaming data pushed from the data source in network devices.
Volume and Velocity: The telemetry data is intended to be consumed by machine rather than by a human. Therefore, the data volume is huge and the processing is often in realtime.
Normalization and Unification: Telemetry aims to address the overall network automation needs. The piecemeal solutions offered by the conventional OAM approach are no longer suitable. Efforts need to be made to normalize the data representation and unify the protocols.
Model-based: The data is model-based which allows applications to configure and consume data with ease.
Data Fusion: The data for a single application can come from multiple data sources (e.g., cross-domain, cross-device, and cross-layer) and needs to be correlated to take effect.
Dynamic and Interactive: Since the network telemetry means to be used in a closed control loop for network automation, it needs to run continuously and adapt to the dynamic and interactive queries from the network operation controller.

The ideal network telemetry solution should also support the following features:

In-Network Customization: The data can be customized in network at run-time to cater to the specific need of applications. This needs the support of a programmable data plane which allows probes to be deployed at flexible locations.
Direct Data Plane Export: The data originated from data plane can be directly exported to the data consumer for efficiency, especially when the data bandwidth is large and the real-time processing is required.
In-band Data Collection: In addition to the passive and active data collection approaches, the new hybrid approach allows to directly collect data for any target flow on its entire forwarding path.
Non-intrusive: The telemetry system should not fall into the trap of the "observer effect". That is, it should not change the network behavior or affect the forwarding performance.

3. The Necessity of a Network Telemetry Framework

Big data analytics and machine-learning based AI technologies are applied for network operation automation, relying on abundant data from networks. The single-sourced and static data acquisition cannot meet the data requirements. It is desirable to have a framework that integrates multiple telemetry approaches from different layers. This allows flexible combinations for different applications. The framework would benefit application development for the following reasons:

The future autonomous networks will require a holistic view on network visibility. All the use cases and applications need to be supported uniformly and coherently under a single intelligent agent. Therefore, the protocols and mechanisms should be consolidated into a minimum yet comprehensive set. A telemetry framework can help to normalize the technique developments.
Network visibility presents multiple viewpoints. For example, the device viewpoint takes the network infrastructure as the monitoring object from which the network topology and device status can be acquired; the traffic viewpoint takes the flows or packets as the monitoring object from which the traffic quality and path can be acquired. An application may need to switch its viewpoint during operation. It may also need to correlate a service and impact on network experience to acquire the comprehensive information.
Applications require network telemetry to be elastic in order to efficiently use the network resource and reduce the performance impact. Routine network monitoring covers the entire network with low data sampling rate. When issues arise or trends emerge, the telemetry data source can be modified and the data rate can be boosted.
Efficient data fusion is critical for applications to reduce the overall quantity of data and improve the accuracy of analysis.

So far, some telemetry related work has been done within IETF. However, this work is fragmented and scattered in different working groups. The lack of coherence makes it difficult to assemble a comprehensive network telemetry system and causes repetitive and redundant work.

A formal network telemetry framework is needed for constructing a working system. The framework should cover the concepts and components from the standardization perspective. This document clarifies the layers on which the telemetry is exerted and decomposes the telemetry system into a set of distinct components that the existing and future work can easily map to.

4. Network Telemetry Framework

Telemetry can be applied on the data plane, the control plane, and the management plane in a network, as well as other sources out of the network, as shown in Figure 1.

                +------------------------------+
                |                              |
		|       Network Operation      |<-------+
		|          Applications        |        |
                |                              |        |
                +------------------------------+        |
                     ^      ^           ^               |
                     |      |           |               |
                     V      |           V               V
                +-----------|---+--------------+  +-----------+
                |           |   |              |  |           |
                | Control Pl|ane|              |  | External  |
                | Telemetry | <--->            |  | Data and  | 
                |           |   |              |  | Event     |
                |      ^    V   |  Management  |  | Telemetry |
                +------|--------+  Plane       |  |           |
                |      V        |  Telemetry   |  +-----------+
                |               |              |
                | Data Plane  <--->            |
                | Telemetry     |              |
                |               |              |
                +---------------+--------------+

Figure 1: Layer Category of the Network Telemetry Framework

Note that the interaction with the network operation applications can be indirect. For example, in the management plane telemetry, the management plane may need to acquire data from the data plane. Some of the operational states can only be derived from the data plane such as the interface status and statistics. For another example, the control plane telemetry may need to access the FIB in data plane. On the other hand, an application may involve more than one plane simultaneously. For example, an SLA compliance application may require both the data plane telemetry and the control plane telemetry.

At each plane, the telemetry can be further partitioned into five distinct components:

Data Source:: Determine where the original data is acquired. The data source usually just provides raw data which needs further processing. A data source can be considered a probe. A probe can be statically installed or dynamically installed.
Data Subscription:: Determine the protocol and channel for applications to acquire desired data. Data subscription is also responsible to define the desired data that might not be directly available form data sources. The subscription data can be described by a model. The model can be statically installed or dynamically installed.
Data Generation:: The original data needs to be processed, encoded, and formatted in network devices to meet application subscription requirements. This may involve in-network computing and processing on either the fast path or the slow path in network devices.
Data Export:: Determine how the ready data are delivered to applications.
Data Analysis and Storage:: In this final step, data is consumed by applications or stored for future reference. Data analysis can be interactive. It may initiate further data subscription.

                +------------------------------+
                |                              |
		|    Data Analysis/Storage     |
                |                              |         
                +------------------------------+         
                        |               ^               
                        |               |               
                        V               |                
                +---------------+--------------+ 
		|               |              | 
		| Data          | Data         |   
                | Subscription  | Export       |   
                |               |              |   
                +---------------+--------------|   
                |                              |
                |       Data Generation        |
                |                              |
                +------------------------------|
                |                              |
                |       Data Source            |
                |                              |
                +------------------------------+

Figure 2: Components in the Network Telemetry Framework

Since most existing standard-related work belongs to the first four components, in the remainder of the document, we focus on these components only.

4.1. Existing Works Mapped in the Framework

The following table provides a non-exhaustive list of existing works (mainly published in IETF and with the emphasis on the latest new technologies) and shows their positions in the framework.

         +-----------+--------------+---------------+--------------+
         |           | Management   | Control       | Data         |
         |           | Plane        | Plane         | Plane        |
         +-----------+--------------+---------------+--------------+
         |           | YANG Data    | Control Proto.| Flow/Packet  | 
         | Data      | Store        | Network State | Statistics   | 
         | Source    |              |               | States       |  
         |           |              |               | DPI          | 
         +-----------+--------------+---------------+--------------+
         |           | gRPC         | NETCONF/YANG  | NETCONF/YANG | 
         | Data      | YANG PUSH    | BGP           | YANG FSM     |
         | Subscribe |              |               |              |
         |           |              |               |              |
         +-----------+--------------+---------------+--------------+
         |           | Soft DNP     | Soft DNP      | In-situ OAM  | 
         | Data      |              |               | IPFPM        |   
         | Generation|              |               | Hard DNP     | 
         |           |              |               |              |
         +-----------+--------------+---------------+--------------+
         |           | gRPC         | BMP           | IPFIX        |
         | Data      | YANG PUSH    |               | UDP          |  
         | Export    | UDP          |               |              |
         |           |              |               |              |
         +-----------+--------------+---------------+--------------+

Figure 3: Existing Work

4.2. Management Plane Telemetry

4.2.1. Requirements and Challenges

The management plane of the network element interacts with the Network Management System (NMS), and provides information such as performance data, network logging data, network warning and defects data, and network statistics and state data. Some legacy protocols are widely used for the management plane, such as SNMP and Syslog. However, these protocols are insufficient to meet the requirements of the automatic network operation applications.

New management plane telemetry protocols should consider the following requirements:

Convenient Data Subscription:: An application should have the freedom to choose the data export means such as the data types and the export frequency.
Structured Data:: For automatic network operation, machines will replace human for network data comprehension. The schema languages such as YANG can efficiently describe structured data and normalize data encoding and transformation.
High Speed Data Transport:: In order to retain the information, a server needs to send a large amount of data at high frequency. Compact encoding formats are needed to compress the data and improve the data transport efficiency. The push mode, by replacing the poll mode, can also reduce the interactions between clients and servers, which help to improve the server's efficiency.

4.2.2. Push Extensions for NETCONF

NETCONF is one popular network management protocol, which is also recommended by IETF. Although it can be used for data collection, NETCONF is good at configurations. YANG Push extends NETCONF and enables subscriber applications to request a continuous, customized stream of updates from a YANG datastore. Providing such visibility into changes made upon YANG configuration and operational objects enables new capabilities based on the remote mirroring of configuration and operational state. Moreover, distributed data collection mechanism via UDP based publication channel provides enhanced efficiency for the NETCONF based telemetry.

4.2.3. gRPC Network Management Interface

gRPC Network Management Interface (gNMI) is a network management protocol based on the gRPC RPC (Remote Procedure Call) framework. With a single gRPC service definition, both configuration and telemetry can be covered. gRPC is an HTTP/2 based open source micro service communication framework. It provides a number of capabilities which are well-suited for network telemetry, including:

Full-duplex streaming transport model combined with a binary encoding mechanism provided further improved telemetry efficiency.
gRPC provides higher-level features consistency across platforms that common HTTP/2 libraries typically do not. This characteristic is especially valuable for the fact that telemetry data collectors normally reside on a large variety of platforms.
The built-in load-balancing and failover mechanism.

4.3. Control Plane Telemetry

4.3.1. Requirements and Challenges

The control plane telemetry refers to the health condition monitoring of different network protocols, which covers Layer 2 to Layer 7. Keeping track of the running status of these protocols is beneficial for detecting, localizing, and even predicting various network issues, as well as network optimization, in real-time and in fine granularity.

One of the most challenging problems for the control plane telemetry is how to correlate the E2E Key Performance Indicators (KPI) to a specific layer's KPIs. For example, an IPTV user may describe his User Experience (UE) by the video fluency and definition. Then in case of an unusually poor UE KPI or a service disconnection, it is non-trivial work to delimit and localize the issue to the responsible protocol layer (e.g., the Transport Layer or the Network Layer), the responsible protocol (e.g., ISIS or BGP at the Network Layer), and finally the responsible device(s) with specific reasons.

Traditional OAM-based approaches for control plane KPI measurement include PING (L3), Tracert (L3), Y.1731 (L2) and so on. One common issue behind these methods is that they only measure the KPIs instead of reflecting the actual running status of these protocols, making them less effective or efficient for control plane troubleshooting and network optimization. An example of the control plane telemetry is the BGP monitoring protocol (BMP), it is currently used to monitoring the BGP routes and enables rich applications, such as BGP peer analysis, AS analysis, prefix analysis, security analysis, and so on. However, the monitoring of other layers, protocols and the cross-layer, cross-protocol KPI correlations are still in their infancy (e.g., the IGP monitoring is missing), which require substantial further research.

4.3.2. BGP Monitoring Protocol

BGP Monitoring Protocol (BMP) is used to monitor BGP sessions and intended to provide a convenient interface for obtaining route views.

The BGP routing information is collected from the monitored device(s) to the BMP monitoring station by setting up the BMP TCP session. The BGP peers are monitored by the BMP Peer Up and Peer Down Notifications. The BGP routes (including Adjacency_RIB_In, Adjacency_RIB_out, and Local_Rib are encapsulated in the BMP Route Monitoring Message and the BMP Route Mirroring Message, in the form of both initial table dump and real-time route update. In addition, BGP statistics are reported through the BMP Stats Report Message, which could be either timer triggered or event-driven. More BMP extensions can be explored to enrich the applications of BGP monitoring.

4.4. Data Plane Telemetry

4.4.1. Requirements and Challenges

An effective data plane telemetry system relies on the data that the network device can expose. The data's quality, quantity, and timeliness must meet some stringent requirements. This raises some challenges to the network data plane devices where the first hand data originate.

A data plane device's main function is user traffic processing and forwarding. While supporting network visibility is important, the telemetry is just an auxiliary function, and it should not impede normal traffic processing and forwarding (i.e., the performance is not lowered and the behavior is not altered due to the telemetry functions).
The network operation applications requires end-to-end visibility from various sources, which results in a huge volume of data. However, the sheer data quantity should not stress the network bandwidth, regardless of the data delivery approach (i.e., through in-band or out-of-band channels).
The data plane devices must provide timely data with the minimum possible delay. Long processing, transport, storage, and analysis delay can impact the effectiveness of the control loop and even render the data useless.
The data should be structured and labeled, and easy for applications to parse and consume. At the same time, the data types needed by applications can vary significantly. The data plane devices need to provide enough flexibility and programmability to support the precise data provision for applications.
The data plane telemetry should support incremental deployment and work even though some devices are unaware of the system. This challenge is highly relevant to the standards and legacy networks.

The industry has agreed that the data plane programmability is essential to support network telemetry. Newer data plane chips are all equipped with advanced telemetry features and provide flexibility to support customized telemetry functions.

4.4.2. Technique Taxonomy

There can be multiple possible dimensions to classify the data plane telemetry techniques.

Active and Passive:: The active and passive methods (as well as the hybrid types) are well documented in [RFC7799]. The passive methods include TCPDUMP, IPFIX, sflow, and traffic mirror. These methods usually have low data coverage. The bandwidth cost is very high in order to improve the data coverage. On the other hand, the active methods include Ping, Traceroute, OWAMP, and TWAMP. These methods are intrusive and only provide indirect network measurement results. The hybrid methods, including in-situ OAM, IPFPM, and Multipoint Alternate Marking, provide a well-balanced and more flexible approach. However, these methods are also more complex to implement.
In-Band and Out-of-Band:: The telemetry data, before being exported to some collector, can be carried in user packets. Such methods are considered in-band (e.g., in-situ OAM). If the telemetry data is directly exported to some collector without modifying the user packets, Such methods are considered out-of-band (e.g., postcard-based INT). It is possible to have hybrid methods. For example, only the telemetry instruction or partial data is carried by user packets (e.g., IPFPM).
E2E and In-Network:: Some E2E methods start from and end at the network end hosts (e.g., Ping). The other methods work in networks and are transparent to end hosts. However, if needed, the in-network methods can be easily extended into end hosts.
Flow, Path, and Node:: Depending on the telemetry objective, the methods can be flow-based (e.g., in-situ OAM), path-based (e.g., Traceroute), and node-based (e.g., IPFIX).

4.4.3. The IPFPM technology

The Alternate Marking method is efficient to perform packet loss, delay, and jitter measurements both in an IP and Overlay Networks, as presented in IPFPM and [I-D.fioccola-ippm-multipoint-alt-mark].

This technique can be applied to point-to-point and multipoint-to-multipoint flows. Alternate Marking creates batches of packets by alternating the value of 1 bit (or a label) of the packet header. These batches of packets are unambiguously recognized over the network and the comparison of packet counters for each batch allows the packet loss calculation. The same idea can be applied to delay measurement by selecting ad hoc packets with a marking bit dedicated for delay measurements.

Alternate Marking method needs two counters each marking period for each flow under monitor. For instance, by considering n measurement points and m monitored flows, the order of magnitude of the packet counters for each time interval is n*m*2 (1 per color).

Since networks offer rich sets of network performance measurement data (e.g packet counters), traditional approaches run into limitations. One reason is the fact that the bottleneck is the generation and export of the data and the amount of data that can be reasonably collected from the network. In addition, management tasks related to determining and configuring which data to generate lead to significant deployment challenges.

Multipoint Alternate Marking approach, described in [I-D.fioccola-ippm-multipoint-alt-mark], aims to resolve this issue and makes the performance monitoring more flexible in case a detailed analysis is not needed.

An application orchestrates network performance measurements tasks across the network to allow an optimized monitoring and it can calibrate how deep can be obtained monitoring data from the network by configuring measurement points roughly or meticulously.

Using Alternate Marking, it is possible to monitor a Multipoint Network without examining in depth by using the Network Clustering (subnetworks that are portions of the entire network that preserve the same property of the entire network, called clusters). So in case there is packet loss or the delay is too high the filtering criteria could be specified more in order to perform a detailed analysis by using a different combination of clusters up to a per-flow measurement as described in IPFPM.

In summary, an application can configure end-to-end network monitoring. If the network does not experiment issues, this approximate monitoring is good enough and is very cheap in terms of network resources. However, in case of problems, the application becomes aware of the issues from this approximate monitoring and, in order to localize the portion of the network that has issues, configures the measurement points more exhaustively. So a new detailed monitoring is performed. After the detection and resolution of the problem the initial approximate monitoring can be used again.

4.4.4. Dynamic Network Probe

Hardware-based Dynamic Network Probe (DNP) provides a programmable means to customize the data that an application collects from the data plane. A direct benefit of DNP is the reduction of the exported data. A full DNP solution covers several components including data source, data subscription, and data generation. The data subscription needs to define the custom data which can be composed and derived from the raw data sources. The data generation takes advantage of the moderate in-network computing to produce the desired data.

While DNP can introduce unforeseeable flexibility to the data plane telemetry, it also faces some challenges. It requires a flexible data plane that can be dynamically reprogrammed at run-time. The programming API is yet to be defined.

4.4.5. IP Flow Information Export (IPFIX) protocol

Traffic on a network can be seen as a set of flows passing through network elements. IP Flow Information Export (IPFIX) provides a means of transmitting traffic flow information for administrative or other purposes. A typical IPFIX enabled system includes a pool of Metering Processes collects data packets at one or more Observation Points, optionally filters them and aggregates information about these packets. An Exporter then gathers each of the Observation Points together into an Observation Domain and sends this information via the IPFIX protocol to a Collector.

4.4.6. In-Situ OAM

Traditional passive and active monitoring and measurement techniques are either inaccurate or resource-consuming. It is preferable to directly acquire data associated with a flow's packets when the packets pass through a network. In-situ OAM (iOAM), a data generation technique, embeds a new instruction header to user packets and the instruction directs the network nodes to add the requested data to the packets. Thus, at the path end, the packet's experience gained on the entire forwarding path can be collected. Such firsthand data is invaluable to many network OAM applications.

However, iOAM also faces some challenges. The issues on performance impact, security, scalability and overhead limits, encapsulation difficulties in some protocols, and cross-domain deployment need to be addressed.

4.5. External Data and Event Telemetry

Events that occur outside the boundaries of the network system are another important source of telemetry information. Correlating both internal telemetry data and external events with the requirements of network systems, as presented in Exploiting External Event Detectors to Anticipate Resource Requirements for the Elastic Adaptation of SDN/NFV Systems, provides a strategic and functional advantage to management operations.

4.5.1. Requirements and Challenges

As with other sources of telemetry information, the data and events must meet strict requirements, especially in terms of timeliness, which is essential to properly incorporate external event information to management cycles. Thus, the specific challenges are described as follows:

The role of external event detector can be played by multiple elements, including hardware (e.g. physical sensors, such as seismometers) and software (e.g. Big Data sources that analyze streams of information, such as Twitter messages). Thus, the transmitted data must support different shapes but, at the same time, follow a common but extensible ontology.
Since the main function of the external event detectors is to perform the notifications, their timeliness is assumed. However, once messages have been dispatched, they must be quickly collected and inserted into the control plane with variable priority, which will be high for important sources and/or important events and low for secondary ones.
The ontology used by external detectors must be easily adopted by current and future devices and applications. Therefore, it must be easily mapped to current information models, such as in terms of YANG.

Organizing together both internal and external telemetry information will be key for the general exploitation of the management possibilities of current and future network systems, as reflected in the incorporation of cognitive capabilities to new hardware and software (virtual) elements.

5. Evolution of Network Telemetry

As the network is evolving towards the automated operation, network telemetry also undergoes several levels of evolution.

Level 0 - Static Telemetry:: The telemetry data is determined at design time. The network operator can only configure how to use it with limited flexibility.
Level 1 - Dynamic Telemetry:: The telemetry data can be dynamically programmed or configured at runtime, allowing a tradeoff among resource, performance, flexibility, and coverage. DNP is an effort towards this direction.
Level 2 - Interactive Telemetry:: The network operator can continuously customize the telemetry data in real time to reflect the network operation's visibility requirements. At this level, some tasks can be automated, although ultimately human operators will still need to sit in the middle to make decisions.
Level 3 - Closed-loop Telemetry:: Human operators are completely excluded from the control loop. The intelligent network operation engine automatically issues the telemetry data request, analyzes the data, and updates the network operations in closed control loops.

While most of the existing technologies belong to level 0 and level 1, with the help of a clearly defined network telemetry framework, we can assemble the technologies to support level 2 and make solid steps towards level 3.

6. Security Considerations

TBD

7. IANA Considerations

This document includes no request to IANA.

8. Contributors

The other major contributors of this document are listed as follows.

Daniel King
Yunan Gu

9. Acknowledgments

We would like to thank Adrian Farrel, Randy Presuhn, Victor Liu, James Guichard, Uri Blumenthal, Giuseppe Fioccola, and many others who have provided helpful comments and suggestions to improve this document.

10. References

10.1. Normative References

[RFC2119]	Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997.
[RFC8174]	Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017.

10.2. Informative References

[I-D.brockners-inband-oam-requirements]	Brockners, F., Bhandari, S., Dara, S., Pignataro, C., Gredler, H., Leddy, J., Youell, S., Mozes, D., Mizrahi, T., <>, P. and r. remy@barefootnetworks.com, "Requirements for In-situ OAM", Internet-Draft draft-brockners-inband-oam-requirements-03, March 2017.
[I-D.fioccola-ippm-multipoint-alt-mark]	Fioccola, G., Cociglio, M., Sapio, A. and R. Sisto, "Multipoint Alternate Marking method for passive and hybrid performance monitoring", Internet-Draft draft-fioccola-ippm-multipoint-alt-mark-04, June 2018.
[I-D.ietf-grow-bmp-adj-rib-out]	Evens, T., Bayraktar, S., Lucente, P., Mi, K. and S. Zhuang, "Support for Adj-RIB-Out in BGP Monitoring Protocol (BMP)", Internet-Draft draft-ietf-grow-bmp-adj-rib-out-02, September 2018.
[I-D.ietf-grow-bmp-local-rib]	Evens, T., Bayraktar, S., Bhardwaj, M. and P. Lucente, "Support for Local RIB in BGP Monitoring Protocol (BMP)", Internet-Draft draft-ietf-grow-bmp-local-rib-02, September 2018.
[I-D.ietf-netconf-udp-pub-channel]	Zheng, G., Zhou, T. and A. Clemm, "UDP based Publication Channel for Streaming Telemetry", Internet-Draft draft-ietf-netconf-udp-pub-channel-04, October 2018.
[I-D.ietf-netconf-yang-push]	Clemm, A., Voit, E., Prieto, A., Tripathy, A., Nilsen-Nygaard, E., Bierman, A. and B. Lengyel, "YANG Datastore Subscription", Internet-Draft draft-ietf-netconf-yang-push-19, September 2018.
[I-D.kumar-rtgwg-grpc-protocol]	Kumar, A., Kolhe, J., Ghemawat, S. and L. Ryan, "gRPC Protocol", Internet-Draft draft-kumar-rtgwg-grpc-protocol-00, July 2016.
[I-D.openconfig-rtgwg-gnmi-spec]	Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack, C. and C. Morrow, "gRPC Network Management Interface (gNMI)", Internet-Draft draft-openconfig-rtgwg-gnmi-spec-01, March 2018.
[I-D.pedro-nmrg-anticipated-adaptation]	Martinez-Julia, P., "Exploiting External Event Detectors to Anticipate Resource Requirements for the Elastic Adaptation of SDN/NFV Systems", Internet-Draft draft-pedro-nmrg-anticipated-adaptation-02, June 2018.
[I-D.song-opsawg-dnp4iq]	Song, H. and J. Gong, "Requirements for Interactive Query with Dynamic Network Probes", Internet-Draft draft-song-opsawg-dnp4iq-01, June 2017.
[I-D.zhou-netconf-multi-stream-originators]	Zhou, T., Zheng, G., Voit, E., Clemm, A. and A. Bierman, "Subscription to Multiple Stream Originators", Internet-Draft draft-zhou-netconf-multi-stream-originators-03, October 2018.
[RFC1157]	Case, J., Fedor, M., Schoffstall, M. and J. Davin, "Simple Network Management Protocol (SNMP)", RFC 1157, DOI 10.17487/RFC1157, May 1990.
[RFC2981]	Kavasseri, R., "Event MIB", RFC 2981, DOI 10.17487/RFC2981, October 2000.
[RFC3416]	Presuhn, R., "Version 2 of the Protocol Operations for the Simple Network Management Protocol (SNMP)", STD 62, RFC 3416, DOI 10.17487/RFC3416, December 2002.
[RFC3877]	Chisholm, S. and D. Romascanu, "Alarm Management Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, September 2004.
[RFC4656]	Shalunov, S., Teitelbaum, B., Karp, A., Boote, J. and M. Zekauskas, "A One-way Active Measurement Protocol (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006.
[RFC5357]	Hedayat, K., Krzanowski, R., Morton, A., Yum, K. and J. Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", RFC 5357, DOI 10.17487/RFC5357, October 2008.
[RFC6241]	Enns, R., Bjorklund, M., Schoenwaelder, J. and A. Bierman, "Network Configuration Protocol (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011.
[RFC7011]	Claise, B., Trammell, B. and P. Aitken, "Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information", STD 77, RFC 7011, DOI 10.17487/RFC7011, September 2013.
[RFC7276]	Mizrahi, T., Sprecher, N., Bellagamba, E. and Y. Weingarten, "An Overview of Operations, Administration, and Maintenance (OAM) Tools", RFC 7276, DOI 10.17487/RFC7276, June 2014.
[RFC7540]	Belshe, M., Peon, R. and M. Thomson, "Hypertext Transfer Protocol Version 2 (HTTP/2)", RFC 7540, DOI 10.17487/RFC7540, May 2015.
[RFC7799]	Morton, A., "Active and Passive Metrics and Methods (with Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, May 2016.
[RFC7854]	Scudder, J., Fernando, R. and S. Stuart, "BGP Monitoring Protocol (BMP)", RFC 7854, DOI 10.17487/RFC7854, June 2016.
[RFC8321]	Fioccola, G., Capello, A., Cociglio, M., Castaldelli, L., Chen, M., Zheng, L., Mirsky, G. and T. Mizrahi, "Alternate-Marking Method for Passive and Hybrid Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, January 2018.

Authors' Addresses

Haoyu Song (editor) Huawei 2330 Central Expressway Santa Clara, USA EMail: haoyu.song@huawei.com

Tianran Zhou Huawei 156 Beiqing Road Beijing, 100095, P.R. China EMail: zhoutianran@huawei.com

Zhenbin Li Huawei 156 Beiqing Road Beijing, 100095, P.R. China EMail: lizhenbin@huawei.com

Zhenqiang Li China Mobile No. 32 Xuanwumenxi Ave., Xicheng District Beijing, 100032, P.R. China EMail: lizhenqiang@chinamobile.com

Pedro Martinez-Julia NICT 4-2-1, Nukui-Kitamachi Koganei, Tokyo 184-8795 Japan EMail: pedro@nict.go.jp

Laurent Ciavaglia Nokia Villarceaux, 91460 France EMail: laurent.ciavaglia@nokia.com

Aijun Wang China Telecom Beiqijia Town, Changping District Beijing, 102209, P.R. China EMail: wangaj.bri@chinatelecom.cn