Internet DRAFT - draft-wang-nfv-high-availability-gap-analysis
draft-wang-nfv-high-availability-gap-analysis
Network Working Group Y. Wang
Internet-Draft Huawei
Intended status: Standards Track December 30, 2013
Expires: July 3, 2014
NFV High-Availability Technologies Gap Analysis
draft-wang-nfv-high-availability-gap-analysis-00
Abstract
High-Availability (HA) is a very important requirement throughout the
history of carrier network, many technologies have emerged for it.
With the trend of Network Function Virtualization (NFV), network
function are migrated from dedicated hardware to software running
over COTS servers, the same SLA of HA should be provided depending on
network service itself. But some new challenges are brought by NFV,
one example is Virtualized Network Function (VNF) cluster caused by
the HA and performance limitation of individual VNF instance. For
the VNF cluster, some gaps exist between with the available HA
technologies on issues of multi-homing, state synchronization, share-
risk prevention and HA role election, especially in the network which
has a large scale deployment of NFV.
This document firstly identifies the challenges emerged within NFV
deployed networks. Then, available HA technologies are reviewed and
the detailed gap analysis between them with the new challenges is
discussed in depth. At last, the summary of these gaps is presented.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on July 3, 2014.
Copyright Notice
Wang Expires July 3, 2014 [Page 1]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
Copyright (c) 2013 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. New challenges and requirements to NFV HA solution . . . . . . 5
3.1. New features of NFV . . . . . . . . . . . . . . . . . . . 5
3.2. VNF cluster . . . . . . . . . . . . . . . . . . . . . . . 6
3.3. Failure detection and handling . . . . . . . . . . . . . . 6
3.4. Consideration of central management server . . . . . . . . 7
3.5. Policy enforcement . . . . . . . . . . . . . . . . . . . . 7
3.6. NFV HA solution requirements . . . . . . . . . . . . . . . 7
4. Gaps between available HA solutions and NFV's challenges . . . 9
4.1. VRRP . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2. BFD . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3. APS . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.4. NSR, NSF, SSO and GR . . . . . . . . . . . . . . . . . . . 11
4.5. STP . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.6. FRR . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.7. OAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16
7. Security Considerations . . . . . . . . . . . . . . . . . . . 17
8. Informative References . . . . . . . . . . . . . . . . . . . . 18
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 19
Wang Expires July 3, 2014 [Page 2]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
1. Introduction
For the benefits of reduced operational and capital costs, automated
deployment, and enhanced elasticity, Network Function (e.g.
Firewall, IPS, Load Balancer, etc) Virtualization technology would be
widely supported by DC and Carrier network. But, one key issue, the
HA requirement of NFV should be considered over again to provide the
same SLA of HA depending on the network services itself.
Considering the reliability requirements, NFV HA architecture should
support several key points detailed below:
o Redundancy mechanism: LACP, VRRP, ECMP, etc;
o Failure detection: IEEE 802.1ag, ITU-T Y.1731, BFD, MPLS(-TP) OAM,
etc;
o Failure notification: APS, etc;
o State synchronization: NSR, NSF, SSO, GR, etc;
o failure handling (switchover and failover): STP, EAPS, MPLS FRR,
etc.
[VNFP-PS] has provided the problem statement and working scope
analysis focusing on the VNF reliability and high availability
issues. One accompany draft [VNFP-UC] provides an overview of VNF HA
use cases.
In this document, we reviewed the challenges of NFV and traditional
HA solution, and summarize the gap between them.
Wang Expires July 3, 2014 [Page 3]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
2. Terminology
NFV Network Function Virtualization
VRRP Virtual Router Redundancy Protocol
STP Spanning Tree Protocol
FRR Fast Reroute
BFD Bidirectional Forwarding Detection
APS Automatic Protection Switching
HA High Availability
Wang Expires July 3, 2014 [Page 4]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
3. New challenges and requirements to NFV HA solution
As described in [VNFP-PS] and [VNFP-UC], NFV brings some new features
and requirements to current network, which also results in new
challenges to NFV HA solution in the following aspects:
3.1. New features of NFV
NFV network should support high scalability, cost reduction, high
flexibility, high automation.
o High scalability: NFV HA solution should support the large scale
network. To support the HA of large number of VNFs in virtual
network, the highly efficient and minimum impact to network HA
solution is surely needed. NFV HA solution should also support an
automatic and fast mechanism to discover and add/delete new VNFs
into/out of VNF cluster;
o Cost reduction: One of the intentions of deploying NFV is to
reduce the cost, thus, whether for the VNF cluster or working/
protection group deployment, making full use of its members by
active/active mode is the preferred choice. And, 1:N redundancy
mechanism should be the basic requirement rather than 1:1, 1+1 or
1+N. But, multiple active VNFs maybe result in a performance
problem. Because every active VNF needs to synchronize its state
to every other VNF in the same VNF cluster. The full-mesh
connectivity for state synchronization is very complex and will
consume too much bandwidth and system resource to bring additional
delay;
o High flexibility: NFV brings high flexibility to network to make
it easy for operator to scale up or down the virtualized network,
and migrate VNF instances to other location on demand. Thus, VNF
instances can be highly distributed in DC networks, Carrier
networks and even customer premises, and they can be migrated
dynamically. It is hard to located VNF VMs to fixed hosts. If
the VNF instances in the same VNF cluster are located in the same
host or the chains of VNF are crossed, it will reduce the SLA of
HA due to share-risk problem. Another problem is related to
service chain. For the consideration of HA, the working and
backup service chains should not be crossed in the same location
(e.g. host, hypervisor, VM, VNF), otherwise it will lead to a
single point of failure;
o High automation: NFV deployed network can be a large scale, highly
distributed, and highly dynamic network. Which means a large
amount of operations and functions must be automatically
performed. Otherwise, too much manual configurations and
Wang Expires July 3, 2014 [Page 5]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
interventions are not affordable. For example, the deployment and
configuration of redundancy mechanisms, the failure detection and
switchover, the adaptation to dynamic changes of VNFs should all
be performed automatically and quickly.
3.2. VNF cluster
NFV has transferred the running network functions from physical
platform to virtualized platform, always result in the limitation of
individual VNF instance's performance and reliability. So, NFV
normally uses cluster with more members to meet the performance and
reliability requirement. With the VNF numbers in VNF cluster
increasing, some aspects should be considered:
o State synchronization: State synchronization is the essential
function for NFV, especially for the clustering mechanism.
Otherwise, VNF cluster cannot work in several scenarios. One
example is when forward flow and return flow hit different VNFs.
The other example is switchover case. With the number of VNFs in
cluster increased, state synchronization between them with full
mesh becomes more complex and resource consuming. A general and
efficient technology of state synchronization for various kinds of
VNFs is needed;
o Dynamic change of VNF members' role: A VNF member's role (active,
especially standby) in VNF cluster can dynamically change in the
run time due to switchover, NFV HA solution must adapt to it.
3.3. Failure detection and handling
o Failure detection: In NFV infrastructure, failure can happen in
multiple layers: hardware, hypervisor, VM, or VNF instance. So,
the new failure detection mechanism should have the capability to
detect the failure in all these layers. This mechanism should be
simple, uniform, protocol independent, and software driven. One
candidate is BFD;
o Failure handling approaches: NFV technology brings new approaches
to handle failures rather than traditional network. For instance,
scaling up can resolve the problem of over-load. Restarting VNF
instance or VNF dynamic relocation can all be used when VNF fails.
The new HA solutions should make full use of these new NFV
features.
Wang Expires July 3, 2014 [Page 6]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
3.4. Consideration of central management server
The north bound interfaces of VNF can be connected to the central
management server to report VNF's running status. By this way,
central management server can decide the roles of VNFs in VNF cluster
and be used for the switchover mechanism. It can also be used for
the state synchronization.
3.5. Policy enforcement
There could be some policies reflecting the different reliability
class of the service and hence affecting the selection of VNF
instances. Examples would include isolation policies requiring that
VNF instances be placed on separate physical servers or separate DC
sites. Another example is to place some VNF instances in
topologically closed locations. Other policies and related HA issues
are TBD.
3.6. NFV HA solution requirements
For coping with the above new challenges, the NFV HA solution should
provide the functionalities as following:
o Redundancy mechanism
* The function to support VNF cluster with active/active mode
* The function to support 1:N redundancy of VNF cluster;
* The function to support VNFs to join/leave or scale up/down
efficiently and automatically;
* The function of share-risk prevention.
o Failure detection
* The simple, uniform, protocol independent, and software driven
function to detect the failure of multiple layers of NFV
infrastructure.
o Failure notification
* The function to notify central management server about VNF
failure;
* The function to notify other interested or influenced VNF about
VNF failure.
Wang Expires July 3, 2014 [Page 7]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
o State synchronization
* The function to support the dynamic change of VNF member's role
in VNF cluster due to switchover by tracking the binding
relationship between active VNFs with standby VNFs;
* The function to support VNF keep-alive monitoring and efficient
state synchronization to avoid full mesh connectivity by using
central controlled technology.
o Failure handling (switchover and failover)
* The function to support quick convergence for large VNF
cluster;
* The function to support central management server;
* The function to utilize new NFV features, for example scaling
up/down, restarting and dynamic relocation of VNF to overcome
VNF failure or over-load.
o Compatibility
* Supporting different provider's VNF;
* The compatibility to existed hardware network elements.
Wang Expires July 3, 2014 [Page 8]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
4. Gaps between available HA solutions and NFV's challenges
The available HA solutions are the combination of several
technologies deployed in different points of network. The overview
is as followed:
o VRRP is a common redundancy mechanisms in Layer 3 and supports
failure handling;
o BFD is a common technology for failure detection;
o APS protocol is mainly used for failure notification which
naturally leads to failure handling process;
o NSR, NSF, SSO and GR can support state synchronization between
neighboring devices or standby RP during the device restarts;
o STP and MPLS FRR are deployed as failure handling mechanisms in
Layer 2 or 3.
4.1. VRRP
The Virtual Router Redundancy Protocol (VRRP) is designed to
eliminate the single point of failure inherent in the static default
routed environment over Layer 3. The nodes or ports in the same
group share the same virtual IP address with different MAC addresses.
The elements in VRRP group elect the active role using pre-set
priority. This priority/role changes only if switchover or fallback
occurred. The active VRRP element notifies hosts in the subnet using
gratuitous ARP. If the active VRRP element fails, the standby
elements in the VRRP group would select a new active element. The
new active element will sent gratuitous ARP to notify the
corresponding hosts to update their MAC table. Consequently, the
flows with the corresponding virtual IP as destination IP will be
lead to the new active element.
VRRP works as an integration solution of redundancy and failure
handling mechanism.
For requirements of redundancy mechanism:
o VRRP supports cluster. However, there is only one active element
per VRRP group. It does not support multiple active elements in
the same group. Multiple active elements means multiple different
VRRP groups distinguished by different virtual IP. The related
hosts' gateway configuration should be planned carefully to these
IPs, which is very inflexible;
Wang Expires July 3, 2014 [Page 9]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
o VRRP can only support most 255 elements in one group, this maybe a
drawback for NFV deployment in cross DC site scenarios in the
future;
o Only active element can send VRRP ADVERTISEMENT message to notify
standby elements in same group, however standby elements can not
announce their presence. Thus the automatic discovery of new
elements is a problem for state synchronization;
o VRRP does not support share-risk prevention.
For requirements of failure handling:
o VRRP group is an autonomy system without the support of central
management server.Thus, it cannot take advantage of it to make an
optimized choice between the VRRP elements flexibly according to
their respective performance;
o The standby elements take over the load of active element when it
fails. While, if the active one fails due to overload, this is
not effective. VNF can scale up to solve this problem, while VRRP
cannot.
4.2. BFD
BFD is a lightweight hello protocol designed to run over multiple
transport protocols (e.g. IPv4, IPv6, MPLS, etc) used for Layer 3
failure detection. Any interested client (e.g. OSPF, BFP, HSRP,
etc) registers with BFD and is notified as soon as BFD detects a
neighbor loss. BFD establishes monitoring sessions between two
neighbors and detects link or node failures if no BFD packet is
received for a period.
IBFD is a good candidate of failure detection solution for the NFV
network due to its features of simplicity, efficiency, uniformity,
protocol independence, software driven. But, BFD can only detect the
failures of Layer 3 and does not support Layer 2. How it supports
the failure detection for VNF cluster and multiple layer NFV
infrastructures also needs more study.
4.3. APS
APS (Automatic Protection Switching) protocol is a mature and proven
mechanism specified for bidirectional protection switching which
needs the coordination of the two endpoints of the transport entity
in SONET/SDH networks. It can also be used for Ethernet [ITU-T
G.8031] and MPLS [ITU-T Y.1720] network now.
Wang Expires July 3, 2014 [Page 10]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
One endpoint transmits a new APS packet immediately to inform the far
end endpoint for the coordination of protection switching when a
change in the transmitted status occurs (e.g. link/node failure,
force switch, signal fail, etc).
APS protocol is mainly used for failure notification which will
naturally leads to failure handling process.
For requirements of redundancy mechanism:
o APS does not support VNF cluster with active/active mode;
o Usually, APS provide 1:1 or 1+1 backup, and no more than 1:14
backup will limit the scale of NFV network;
o APS does not support dynamic election mechanism of path role. The
role of path is pre-configured;
o APS does not support share-risk prevention.
For requirements of failure notification:
o APS does not provide north bound interface to central management
server.
For requirements of failure handling:
o APS does not support the new NFV failure handling features.
4.4. NSR, NSF, SSO and GR
TBD.
4.5. STP
The Spanning Tree Protocol (STP) is a network protocol that ensures a
loop-free topology for any bridged Ethernet local area network. The
basic function of STP is to prevent bridge loops and the broadcast
storm that results from them. Spanning tree also allows a network
design to include spare (redundant) links to provide automatic backup
paths if an active link fails, without the danger of bridge loops, or
the need for manual enabling/disabling of these backup links.
STP works as an integration solution of redundancy and failure
handling mechanism.
For requirements of redundancy mechanism:
Wang Expires July 3, 2014 [Page 11]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
o STP does not support VNF cluster with active/active mode;
o STP do not provide the elect mechanism for dynamically electing
active and standby elements in elements groups, the roles is
determined by configured network layout and pre- assigned priority
and port ID etc. After convergence, the role of port can only be
changed by manual configuration or at next time of convergence
when some ports fail;
o STP does not provide the share-risk prevention mechanism.
For requirements of failure handling:
o STP does not provide north bound interface to central management
server;
o STP does not support new NFV failure handling features.
4.6. FRR
MPLS Fast Reroute is a local restoration network resiliency
mechanism. It is actually a feature of resource reservation protocol
(RSVP) traffic engineering (RSVP-TE). In MPLS local protection each
label switched path (LSP) passing through a facility is protected by
a backup path which originates at the node immediately upstream to
that facility.
FRR works as an integration solution of redundancy and failure
handling mechanism.
For requirements of redundancy mechanism:
o FRR does not support VNF cluster with active/active mode;
o The FRR paths cannot dynamically elect active or standby paths, it
is manual configured or layout by aptotic algorithms like LFA(loop
free alternate)[RFC 5286] ;
o FRR does not support share-risk prevention;
o Usually, 2 or a little more paths can be protected in FRR, the
large number of VNFs in a cluster will make network layout very
complex.
For requirements of failure handling:
o FRR does not provide north bound interfaces to central management
server;
Wang Expires July 3, 2014 [Page 12]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
o FRR also has the similar problem with VRRP in the deployment of
new NFV failure handling features.
4.7. OAM
TBD
Wang Expires July 3, 2014 [Page 13]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
5. Summary
In conclusion, there is a gap between the available HA technologies
and the new challenges of NFV.
+--------------------+---------+---------+---------+---------+
| | VRRP | APS | STP | FRR |
+--------------------+---------+---------+---------+---------+
| Support | | | | |
| active/active | no | no | no | no |
| cluster | | | | |
+--------------------+---------+---------+---------+---------+
| Support 1:N | no | no | no | no |
| backup | | | | |
+--------------------+---------+---------+---------+---------+
| Automatic | no | no | no | no |
| scalability | | | | |
+--------------------+---------+---------+---------+---------+
| Share-risk | no | no | no | no |
| prevention | | | | |
+--------------------+---------+---------+---------+---------+
Figure 1: Gap Analysis Table of Redundancy Mechanism
Note:
1. For NFV, VNF cluster with active/active mode is one of the basic
requirements;
2. 1:N backup means 1 standby element for N active elements in one
VNF cluster.
+--------------------------------------+--------------------+
| | BFD |
+--------------------------------------+--------------------+
| Support failure detection of | TBD |
| VNF cluster and multiple layers | |
+--------------------------------------+--------------------+
Figure 2: Gap Analysis Table of failure Detection
Wang Expires July 3, 2014 [Page 14]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
+--------------------------------+--------------+
| | APS |
+--------------------------------+--------------+
| Notify central network | no |
| server | |
+--------------------------------+--------------+
| Notify related VNF | yes |
+--------------------------------+--------------+
Figure 3: Gap Analysis Table of failure Notification
+-----------------------------------------+---------------------+
| | |
+-----------------------------------------+---------------------+
| Dynamically election | |
+-----------------------------------------+---------------------+
| Full mesh avoiding | |
+-----------------------------------------+---------------------+
Figure 4: Gap Analysis Table of State Synchronization
+-------------------+---------+--------+----------+----------+
| | | | | |
| |VRRP |APS |STP |FRR |
| | | | | |
| | | | | |
+-------------------+---------+--------+----------+----------+
| North bound API |no |no |no |no |
+-------------------+---------+--------+----------+----------+
| Support new |no |no |no |no |
| NFV | | | | |
| approaches | | | | |
+-------------------+---------+--------+----------+----------+
Figure 5: Gap Analysis Table of failure handling
Wang Expires July 3, 2014 [Page 15]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
6. IANA Considerations
This document has no actions for IANA.
Wang Expires July 3, 2014 [Page 16]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
7. Security Considerations
TBD.
Wang Expires July 3, 2014 [Page 17]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
8. Informative References
[VNF-PS] Zong, N., "Problem Statement for Reliable Virtualized
Network Function (VNF) Pool",
ID draft-zong-vnfpool-problem-statement-01, September 2013.
[VNF-UC] Xia, L., "Use cases and Requirements for Virtual Service
Node Pool Management",
ID draft-xia-vsnpool-management-use-case-01, October 2013.
Wang Expires July 3, 2014 [Page 18]
Internet-Draft NFV HA Technologies Gap Analysis December 2013
Author's Address
Yang Wang
Huawei
101 Software Avenue, Yuhua District
Nanjing, Jiangsu 210012
China
Email: alex.wangyang@huawei.com
Wang Expires July 3, 2014 [Page 19]