Network Working Group | Q. Wu |
Internet-Draft | J. Strassner |
Intended status: Informational | Huawei |
Expires: September 10, 2016 | A. Farrel |
Old Dog Consulting | |
L. Zhang | |
Huawei | |
March 9, 2016 |
Network Telemetry and Big Data Analysis
draft-wu-t2trg-network-telemetry-00
This document focuses on network measurement and analysis in the network environment. It first defines network telemetry, describes an exemplary network telemetry architecture, and then explores the characteristics of network telemetry data. It ends with detailing a set of issues with retrieving and processing network telemetry data.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 10, 2016.
Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
Today, billions of devices can connect to the internet and VPN and establish a good ecosystem of connectivity. Our daily life also has been greatly changed with a large number of IoT applications and mobile application being built on top of it (e.g., smart tags on many daily life objects, wearable health monitoring sensors, smartphones, intelligent cars, and smart home appliances). However, the increased amount of connection of devices and the proliferation of web and multimedia services also imposes a great impact on the network. Examples include:
Therefore the network may be subject to increased network incidents and unregulated network changes, without better network visibility or a good view of the available network resources and network topology, it is not easy to
In this document, we first define network telemetry in the context of network environment, followed by an exemplary architecture for collecting and processing telemetry data. We then explore the characteristics of network telemetry data, and end with describing a set of issues with retrieving and processing network telemetry data.
Network Telemetry describes how information from various data sources can be collected using a set of automated communication processes and transmitted to one or more receiving equipment for analysis tasks. Analysis tasks may include event correlation, anomaly detection, performance monitoring, metric calculation, trend analysis, and other related processes.
A Network Telemetry architecture describes how different types of Network Telemetry data are transmitted from different network sources and received by different collection entities. In an ideal network telemetry architecture, the ability to collect data should be independent of any specific application and vendor limitations. This means that protocol and data format translation are required, so that a normalized form of data can be used to simplify the various analysis and processing tasks required.
The Network Telemetry architecture is made up of the following three key functional components:
Figure 1 shows an exemplary architecture for network telemetry and analysis.
+----------------------+ | Policy-based Manager | +----------+-----------+ / \ | | +----------------+----------+-----------------------+ | | | | | | \ / \ / \ / +----------------+ +--------+-----------+ +----+-----+ | Data Analyzer, |/ \| Data Fusion, |/ \| Decision | | Normalizer, +---------+ Analytics, +-----+| Logic | | Filter, etc. |\ /| and other Apps |\ /| and Apps | +--------+-------+ +---+------------+---+ +----------+ / \ / \ / \ | | | | | | \ / \ / \ / +--------+-------------+ +----+----+ +----+----+ | Data Abstraction and | | Other | | Other | | Modeling Software | | OT Data | | IT Data | +------+--------+------+ +---------+ +---------+ / \ / \ | | | | \ / \ / +----+--------+-----+ | Data Collectors | +----+---------+----+ / \ / \ | | | | | \ / | +----+------------+ +-----------+ | | Edge Software |/ \| Temporary | | | (analysis & +--------+ Data | | | transformation) |\ /| Storage | | +------------+----+ +-----------+ | / \ | | | | \ / \ / +-------+------+ +------+-------+ | Data Sources | | Data Sources | +--------------+ +--------------+
Network Telemetry and Analysis Architecture
This reference architecture assumes that Data Collectors can choose different measurement data formats to gather measurement data, and different protocols to transmit said data; the Data Abstraction and Modeling Software normalizes collected data into a common form. Both the Data Collector and the Data Analyzer may support data filtering, correlation, and other types of data processing mechanisms. In the above architecture, bi-directional communication is shown for generality. This may be implemented a number of different ways, such as using a request-response mechanism, a publish-subscribe mechanism, or even as a set of uni-directional (e.g., push and pull) requests.
Measurement data is generated from different data sources, and has varying characteristics, including (but not limited to):
Today, the existing data feching methods (See appendix B) prove insufficiency due to the following factors:
In addition, it adds significant load on participating networks, devices, and applications
Quality of Service (QoS) and Quality of Experience (QoE) assessment [RFC7266] of multimedia services has been well studied in ITU-T SG 12. Media quality is commonly expressed in terms of MoS (Mean Opinion Score) [RFC3611][G107]. MoS is typically rated on a scale from 1 to 5, in which 5 represents excellent and 1 represents unacceptable. When multimedia application quality becomes bad,it is hard to know whether this is network problem or application specific problem(e.;g.,Codec type, Coding bit rate, packetization scheme, loss recovery technique,the interaction between transport problems and application-layer protocols ). To make sure this is not network problem or know how serious network events or network interrruption is, network health index or network key performance Index(KPI) or key quality index(KQI) becomes important.
However, QoS/QoE assessment of network service that is dependent on or not dependent on the underlying network technology (e.g., MPLS, IP) is not well studied or defined in any body or organization. The QoS/QoE of generic network services requires a set of appropriate network performance, reliability, or other metric definitions. This may take the form of key quality and or performance indicators, ranging from high-level metrics (e.g., dropped calls) to low-level metrics (e.g., packet loss, delay, and jitter). IP service performance parameters are defined in ITU-T Y.1540 [Y1540]; however, these existing network performance metrics are proving insufficient due to several factors:
The data format is typically vendor- and device-specific. This also means that different commands, having different syntax and semantics characteristics that use different protocols, may have to be issued to retrieve the same type of data from different devices.
The Data Analyzer may need to ingest data in a specific format that is not supported by the Data Collectors that service it. For example, the ALTO data format used between a data source and a Data Collector generates an abstracted network topology and provides it to network-aware applications (i.e., a Data Analyzer) over a web service based API [I-D.wu-alto-te-metrics]. In this case, prefix data in the network topology information need to be generated into ALTO Network Maps, TE (topology) data needs to be generated into ALTO Cost Maps. To provide better data format mapping, ALTO Network Map and Cost MAP need to be modeled in the same way as prefix data and TE data in the network topology information. However, these data use different data formats, and do not have a common model structure to represent them in a consistent way.
This is why the architecture shown in Figure 1 has a "Data Abstraction and Modeling Software" component. This component normalizes all data received into a common format for analysis and processing by the Data Analyzer. If this component is not present, then the Data Analyzer would have to deal with m vendor devices X n versions of software for each device at a minimum. Furthermore, different protocols have different capabilities, and may or may not be able to transmit and receive different types of data. The Data Abstraction and Modeling Software component can provide information that defines the structure of data that should be received; this can be useful for checking for incomplete collection data as well as missing collection data.
To provide consistent configuration, reporting and representation for OAM information, the LIME YANG model [I-D.draft-ietf-lime-yang-oam-model-01] is proposed to correlate defects, faults, and network failures between the different layers and irregardless of network technologies. This helps improve efficiency of fault detection and localization, and provide better OAM visibility.
Today we see large amounts of data collected from different data sources. These data can be network log data, network event data, network performance data, network fault data, network statistics state, network operation state. However, these data can only be meaningful if they are correlated in time and space. In particular, useful trend analysis and anomaly detection depend on proper correlation of the data collected from the different Data Sources. In addition, Correlate different type data from different Data Sources with time or space can provide better network visibility. But such correlations is still an challenging issue.
When retrieving data from Data Sources or Data Collectors, synchronization the same type of data between data source and data collector or between data collector and data analyzer is a complicated thing.
The reference architecture of Figure 1 defines a "Policy-based Manager" to manage the set of data that are collected how, when, where, and by which devices. This component provides mechanisms that help ensure that needed information is collected by the appropriate components of the Network Telemetry Architecture. It also facilitates the synchronization of different components that make up the Network Telemetry Architecture, since these are likely distributed throughout one or more networks.
It also provides a mechanism for the Data Analyzer, or other applications (e.g., the "Data Fusion, Analytics, and other Apps", as well as the "Decision Logic and Apps" components in Figure 1) to provide information to the Policy-based Manager in the form of feedback (e.g., see [I-D.draft-strassner-anima-control-loops-01]).
+-----------------------------|------------------------------+ | Data Source Catetory | Information | | | | ------------------------------|------------------------------- | Network Data | Usage records | | | Performance Monitoring Data| | | Fault Monitoring Data | | | Real Time Traffic Data | | | Real Time Statistics Data | | | Network Configuration Data | | | Provision Data | ------------------------------|------------------------------- | | | | Subscriber Data | Profile Data | | | Network Registry | | | Operation Data | | | Billing Data | | | | ------------------------------|------------------------------| | | | | Application Data derived | Traffic Analysis | | from interfaces, channels, | Web, Search, SMS, Email | | software, etc. | Social Media Data | | | Mobile apps | +-----------------------------|------------------------------+
There are three typical Log data Collection methods:
Text base Log data is designed for low speed network. The amount of IoT data can not be too large. It only can be parsed by the network personnel with experience to define such kind of Log. The log data can be transferred either by Email or via FTP. The difference between using Email and using FTP are:
SNMP Trap is a notification mechanism which enables an agent to notify the management system of significant events by way of an unsolicited SNMP message. In case there are large number of devices and each device has large number of objects, SNMP Trap is more efficient to get the data than polling information from every object on every device.
Syslog protocol is used to convey event notification messages and allows the use of any number of transport protocols for transmission of syslog messages. It is widely used in the network device((e.g., switch, router) .
Network Traffic Collection is a process of exporting network traffic flow information from routers, probes and other devices. It doesn't care operation state on the network device but traffic flow characteristic on the links between any two adjacent network device. Take IPFIX as an example, it is widely adopted in the router and switch to get IP traffic flow information for the network management system.
Network performance collection is a process of exporting network performance information from routers, probers and other devices. The network peformance information can be applied to the quality, performance, and reliability of data delivery services and applications running over network. It is also applied to traffic contract argreed by the user and the network service provider. Measurement mechanism defined in IPPM WG and OAM technology and OAM tools can be used to perform performance measurement.
Network fault collection is a process of exporting network fault, failure, warning, defects from router, probers and other devices. It usually adopts OAM technology,OAM tools, OAM model(e.g., SNMP MIB or NETCONF YANG model) to localize fault and pinpoint fault location. However OAM YANG model is mainly focused on configure OAM functionality on the network element, how to use OAM YANG model to collect more data, e.g., warning, failure, defects and how to use these data needs to be further standardized.
For network topology data collection, routing protocols are important collection method, since every router need to propagate its information throughout the whole network. In addition, we can use NMS/OSS to get network topology data if they have access to network topology database or routing protocols.
Network Topology data comprise node information and link information. It can be collected in two typical ways, if the network topology data is within one IGP area or one AS, we can use ISIS protocol or OSPF to gather them and write into RIB or topology datasore, and then we can use I2RS protocol to read these network topology data; if the network topology data is beyond one IGP area and span across several domains, we can use BGP-LS [I-D.ietf-idr-ls-distribution][I-D.ietf-idr-te-pm-bgp] to collect network topology data in different domain and aggregated them in the central network topology database.
To collect and process large volume of data in real time or in near real time to detect subtle event and aid failure diagnosis, we can choose some other data fetching efficient tools, e.g., Facebook's Scribe, Chukwa built on top of Hadoop File subsystem to parse out structured data from some of the logs and load them into a datastore.