Internet Research Task Force (IRTF)                         R. Krishnan
Internet Draft                                                  Brocade
Category: Informational                              Dilip Krishnaswamy
                                                           IBM Research
                                                            D. R. Lopez
                                                         Telefonica I+D
                                                             Asif Qamar
                                                          Steven Wright
                                                       Norival Figueira

Expires: April 2015                                   November 11, 2014

         NFV Real-time Analytics and Orchestration: Use Cases and
                         Architectural Framework



   One of the key goals of NFV is to optimize the infrastructure
   resource usage while driving operational simplicity. Real-time
   analytics providing insight into various components such as compute
   (e.g. dynamic CPU utilization), storage (e.g. dynamic capacity
   usage), network (e.g. dynamic bandwidth utilization), energy (e.g.
   dynamic power consumption) is key to not only providing visibility
   into the NFV infrastructure and thus driving operational simplicity
   but also optimizing resource usage for the purposes of
   orchestration. This draft focusses on use cases and architectural
   framework for real-time analytics and orchestration including Big
   Data predictive analytics for addressing the aforementioned

1. Introduction

   Operator Network Function Virtualization Infrastructure Point-of-
   Presence (NFVI-PoP) locations [ETSI-NFV-TERM] often have capacity,
   energy and other constraints. Thus, optimizing overall resource
   usage is an important requirement [ETSI-NFV-REQ]. The general case
   must consider a distributed (elastic) VNF NFVI platform
   implementation where VMs running for different VNFs (with different
   characteristics) can co-exist in the same physical server. This case
   must address the goal of optimizing overall resource usage through
   mechanisms like bin-packing [BIN-PACK].

   In this context, some of the important challenges faced are:

     .  Performance issues due to noisy neighbor effect, where a VM
        running for a VNF can affect the VM(s) running for another VNF.

     .  Security issues, especially due to inconsistent configuration
        in a dynamic environment where one VNF could affect others.

     .  Energy Efficiency given that servers have substantial idle
        power usage.

     .  Resources used (Compute, bandwidth, storage) for the real-
        time analytics in comparison to the VNF payload resource usage.

   The purpose of this document is two-fold. First, it intends to
   discuss various use cases to describe the above challenges. Second,
   it will depict an architectural framework for real-time analytics
   and orchestration, applicable to the above use cases in a multi-
   vendor environment.

   For the purposes of real-time analytics for orchestration, various
   metrics need to be collected, stored and analyzed.

   Metrics collection: Metric collection may occur at different periods
   during the lifecycle of the VNF.  Metric collection during an
   onboarding process in a controlled load configuration may provide a
   baseline for characterization of "normal" operational performance.
   Such baseline characterizations may be useful for detection "out of
   normal" performance at a later point in the VNF lifecycle.

   Metrics storage: It is recommended to store and analyze metrics
   locally to minimize the costs of backhaul to remote locations.

   Metrics analysis: The assumption here is that the metrics to be
   collected for analysis would be VNF independent in the sense that
   they would apply regardless of the type of VNF.  Metrics that are
   specific to particular types of VNF are more appropriate for service
   specific diagnostics.

2. Real-time Analytics and Orchestration Use Cases

   A real-time analytics application periodically collects metrics
   (also called information in this document) from individual VMs,
   VNFs, physical servers, network elements etc. regarding various sub-
   systems such as compute (e.g. dynamic CPU utilization), storage
   (e.g. dynamic capacity usage), network (e.g. dynamic bandwidth
   utilization), energy (e.g. dynamic power consumption) through
   polling. The real-time analytics application computes the average
   utilization for VMs, VNFs, physical servers, networks etc. regarding
   the various sub-systems such as compute (e.g. average CPU
   utilization), storage (e.g. average capacity usage), network (e.g.
   average bandwidth utilization), energy (e.g. average power

   Using the average utilization information, the real-time analytics
   application provides real-time visibility into the operating point
   of the VNF in the NFV Node thus driving operational efficiency.

   The NFV orchestrator uses the average utilization information from
   the real-time analytics application to determine the appropriate
   time to scale up/down the running software instances. Typically the
   thresholds for scale up/down are manually programmed into the system
   - this may not be performance optimal since the workloads and
   deployment scenarios can substantially vary.

   In addition, predictive analytics based on machine learning
   techniques [MACHINE-LEARNING-BOOK] can be used by the real-time
   analytics application to automatically determine the appropriate
   thresholds for scale up/down the running software instances for
   differing workloads including events related to social behavior
   (think of a YouTube video going viral) and deployment scenarios.
   This information can be used by the orchestrator for optimizing
   overall performance and maximizing energy efficiency. Maximizing
   energy efficiency comes from the fact that by determining the
   appropriate thresholds for scale up/down the workloads can be
   consolidated into a minimum set of physical resources so the rest of
   the unused physical resources can be completely powered off to avoid

   any idle power consumption. [SPEC-BENCHMARK] analyzes the power
   profile of physical servers from various vendors; the active idle
   power consumption of physical servers could be as much as 30%.

2.1. Enhancements to Real-time Analytics Application

2.1.1. Distributed Predictive Analytics

   A real-time analytics application could be notified of significant
   events by individual running software instances of VMs, VNFs etc. or
   by infrastructure elements such as physical servers, hypervisors
   etc. This helps reduce the rate of polling by the real-time
   analytics application and also helps in reacting to significant
   events such as overload much faster. The challenge in this case is
   to determine the appropriate thresholds (e.g. average power
   consumption has been higher than x Watts for t seconds) for event

   Predictive analytics engines which use machine learning techniques
   [MACHINE-LEARNING-BOOK] can be used to determine the appropriate
   thresholds per running software instance and infrastructure element
   for different workloads and deployment scenarios. These predictive
   analytics engines can run in various nodes in the infrastructure in
   a distributed predictive analytics architectural framework.

2.1.2. Detecting Noisy Neighbors

   In the context of multiple VNFs, "Noisy Neighbor Effect" could be
   defined as follows: the VM running for one VNF can affect the
   performance of a VM running for another VNF in the case where they
   are using the same physical resources (physical servers, physical
   network elements). A real-time analytics application could help in
   detecting and mitigating the noisy neighbor effect. A good example
   is the case where the VMs running for two VNFs share the same
   physical server, are memory access intensive (load balancers,
   firewalls etc.) and have correlated memory access patterns for the
   given workload and deployment scenario.

   Real-time big data analytics techniques [RT-ANALYTICS-BOOK] can be
   used by the analytics application to determine such correlation
   patterns which can affect performance in real-time. Additionally,
   predictive analytics based on machine learning techniques [MACHINE-
   LEARNING-BOOK] can be used to predict the frequency and duration of
   such correlation patterns. This information can be used to create
   dynamic anti-affinity rules for VM placement and migration including
   redundancy considerations - e.g. VMs of VNF "A" cannot co-exist with
   VMs of VNF "B".

2.1.3. Addressing security issues due to inconsistent configuration

   NFV configuration is expected to be dynamic, especially in the edge
   NFV PoPs where capacity is limited; a very good example is handling
   a viral event such as mobile gaming application. While autonomic
   networking techniques could be used to automate the configuration
   process including modular updates, it is important to take into
   account that incomplete and/or inconsistent configuration may lead
   to security issues. Distributed VNF implementations (e.g. VMs of
   single VNF which span different physical servers) typically use an
   eventually consistent configuration model [CAP-THEOREM] for
   scalability reasons -- this poses additional security challenges.

   Real-time analytics techniques [RT-ANALYTICS-BOOK] can be used by
   the analytics application to determine communication pattern
   anomalies due to incomplete and/or inconsistent configuration in
   real-time by analyzing event logs. Additionally, predictive
   analytics based on machine learning techniques [MACHINE-LEARNING-
   BOOK] can be used to predict the frequency and duration of such
   communication pattern anomalies. A simple example is a flow-specific
   firewall rule which never got installed due to reasons such as
   control plane messaging issues, data plane table full condition etc.

   [ETSI-NFV-WHITE]  "ETSI NFV White Paper,"


   [ETSI-NFV-REQ]   "ETSI NFV Virtualization Requirements,"

   [ETSI-NFV-ARCH]   "ETSI NFV Architectural Framework,"

   [ETSI-NFV-TERM] "Terminology for Main Concepts in NFV,"

   [OPENSTACK]  "OpenStack Open Source Software,"

   [OPENSTACK-CONGRESS-POLICY-ENGINE] "A policy as a service open
   source project in OpenStack,"



   [NFV-MANO-SPEC] "NFV Management and Orchestration Framework

   [BIN-PACK] Coffman, Jr., E., M. Garey, and D. Johnson. Approximation
   Algorithms for Bin-Packing -- An Updated Survey. In Algorithm Design
   for Computer System Design, ed. by Ausiello, Lucertini, and
   Serafini. Springer-Verlag, 1984.

   [SPEC-BENCHMARK] "SPEC Benchmark Results: HP Proliant DL380p Rack

   [CAP-THEOREM] Eric Brewer, "CAP twelve years later: How the "rules"
   have changed", IEEE Explore, Volume 45, Issue 2 (2012), pg. 23-29.

   [MACHINE-LEARNING-BOOK] Ian H. Witten et al., "Practical Machine
   Learning Tools and Techniques, Third Edition," Morgan Kaufmann, 2011

   [RT-ANALYTICS-BOOK] Byron Ellis, "Real-Time Analytics: Techniques to
   Analyze and Visualize Streaming Data," Wiley, 2014

Authors' Addresses

   Ram (Ramki) Krishnan
   Brocade Communications

   Dilip Krishnaswamy
   IBM Research

   Diego Lopez
   Telefonica I+D
   Don Ramon de la Cruz, 82
   Madrid, 28006, Spain
   +34 913 129 041

   Asif Qamar

