Internet DRAFT - draft-mlk-nfvrg-nfv-reliability-using-cots
draft-mlk-nfvrg-nfv-reliability-using-cots
NFV RG L. Mo
Internet-Draft B. Khasnabish, Ed.
Intended status: Informational ZTE (TX) Inc.
Expires: April 19, 2016 October 17, 2015
NFV Reliability using COTS Hardware
draft-mlk-nfvrg-nfv-reliability-using-cots-01
Abstract
This draft discusses the results of a recent study on the feasibility
of using Commercial Off-The-Shelf (COTS) hardware for virtualized
network functions in telecom equipment. In particular, it explores
the conditions under which the COTS hardware can be used in the NFV
(Network Function Virtualization) environment. The concept of silent
error probability is introduced in order to take software error or
undetectable hardware failures into account. The silent error
probability is included in both the theoretical work and the
simulation work. It is difficult to theoretically analyze the impact
of site maintenance and site failure events. Therefore, simulation
is used for evaluating the impact of these site management related
events which constitute the undesirable feature of using COTS
hardware in telecom environment.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on April 19, 2016.
Copyright Notice
Copyright (c) 2015 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Mo & Khasnabish Expires April 19, 2016 [Page 1]
Internet-Draft NFV Reliability using COTS Hardware October 2015
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Conventions used in this document . . . . . . . . . . . . . . 4
2.1. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 4
3. Network Reliability . . . . . . . . . . . . . . . . . . . . . 5
4. Network Part of the Availability . . . . . . . . . . . . . . . 7
5. Theoretical Analysis of Server Part of System Availability . . 9
6. Simulation Study of Server Part of Availability . . . . . . . 12
6.1. Methodology . . . . . . . . . . . . . . . . . . . . . . . 13
6.2. Validation of the Simulator . . . . . . . . . . . . . . . 16
6.3. Simulation Results . . . . . . . . . . . . . . . . . . . . 17
6.4. Multiple Servers Sharing the Load . . . . . . . . . . . . 19
7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 20
8. Security considerations . . . . . . . . . . . . . . . . . . . 21
9. IANA considerations . . . . . . . . . . . . . . . . . . . . . 21
10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 21
11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22
11.1. Normative References . . . . . . . . . . . . . . . . . . . 22
11.2. Informative References . . . . . . . . . . . . . . . . . . 22
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22
Mo & Khasnabish Expires April 19, 2016 [Page 2]
Internet-Draft NFV Reliability using COTS Hardware October 2015
1. Introduction
Using COTS hardware for network functions (e.g. IMS, EPC) have drawn
considerable attention in the recent years. Some operators do have
legitimate concern regarding the reliability of using the COTS
hardware, with reduced MTBF (mean time between failures) and many
undesirable attributes of COTS hardware unfamiliar in the traditional
telecom industry.
In the previous reliability studies (e.g. GR-77 [1]), the emphasis
were place on hardware failures only. In this work, besides hardware
failures, which characterized by the MTBF (mean time between
failures) and MTTR (mean time to repair), the silent error is also
introduced to take account the software error and hardware failure
which is undetectable by the management system.
The silent error affecting the system availability in different ways,
depending on the particular scenarios.
In a typical system, a server performing certain network functions
will have another dedicated server as backup. This is normal master-
slave or 1+1 redundancy configuration of the telecom equipment.
The server performing the network function is called the "master
server" and the dedicated backup is called the "slave server." In
order to differentiate the 1+1 redundancy scheme and 1:1 redundancy
scheme, the slave server is deemed "dedicated" for 1+1 case. In 1:1
redundancy, both servers will perform network functions while
protecting each other at the same time.
In any protection scheme, on assuming single fault for clarity of
discussion, the system availability will not be impacted if the slave
part experience silent error and such silent error eventually
becoming observable in behavior. In this case, another slave will be
identified and the master server will continue to serve the network
function. Before the slave server becoming fully functional, the
system will operate at reduced error correction capabilities.
On the other hand, if the master server experience the silent error,
the data transmitted to the slave server could be corrupted. In this
case, the system availability will be impacted when such error
becoming observable. On detection of such error, both the master
server and the slave server need time to recover. The time for such
recovery is fixed in the NFV environment, which is deemed to be a NFV
MTTR time. During this time interval, the network function is not
available and will be considered to be the downtime in the
availability calculations.
Mo & Khasnabish Expires April 19, 2016 [Page 3]
Internet-Draft NFV Reliability using COTS Hardware October 2015
Comparing the MTBF of COTS hardware and the typical telecom grade
hardware, the COTS may have less MTBF due to its relaxed design
criteria.
Comparing the MTTR of COTS hardware and the typical telecom grade
hardware, the COTS time to repair is not a random variable and
actually is fixed. Hence the COTS MTTR is the time required to bring
up a server and ready to serve. In the traditional telecom hardware,
the time to repair is a random variable and MTTR is the mean of this
random variable. Because manual intervention is normally required in
the telecom environment, the NFV COTS MTTR is normally assumed to be
less than the traditional telecom equipment MTTR.
The most obvious difference between those two hardware types (COTS
hardware and the telecom grade hardware) is related to its
maintenance procedure and practice. While telecom equipment takes
pains to minimize the impact of maintenance on system availability,
the COTS hardware normally is maintained in a cowboy fashion (e.g.
reset first and ask questions later).
In this study, a closed solution is available if the site and
maintenance related issues are absent for one or two dedicated backup
COTS servers in the NFV environment. In order to evaluate the site
and maintenance related issues, a simulator is constructed to study
the system availability with one or two dedicated backup servers.
It is shown that, with COTS hardware and all its undesirable
features, it is still possible to satisfy the telecom requirements
under reasonable conditions.
2. Conventions used in this document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC-2119 [RFC2119].
In this document, these words will appear with that interpretation
only when in ALL CAPS. Lower case uses of these words are not to be
interpreted as carrying RFC-2119 significance.
2.1. Abbreviations
o A-N: Network Availability
o A-S: Server Availability
Mo & Khasnabish Expires April 19, 2016 [Page 4]
Internet-Draft NFV Reliability using COTS Hardware October 2015
o A-Sys: System Availability
o COTS: Commercial Off-The-Shelf
o DC: Data Center
o MTBF: Mean Time Between Failures
o MTTF: Mean Time To Failure
o MTTR: Mean Time To Repair
o NFV: Network Function Virtualization
o PGUP: Protection Group Up Time
o PSTN: Public Switched Telephone Network
o SEPSA: Silent Error Probability, Server Availability
o SDN: Software-Defined Network/Networking
o TET: Total Elapsed Time
o VM: Virtual Machine
o WDT: Weighted Down Time
3. Network Reliability
In the NFV environment, the reliability analysis can be divided into
two distinct parts: the server part and the network part, where the
network part is to connect all the servers with vSwitch and the
server part is to provide the actual network functions. This can be
illustrated by using a diagram as shown in Figure-1.
Mo & Khasnabish Expires April 19, 2016 [Page 5]
Internet-Draft NFV Reliability using COTS Hardware October 2015
+--------------------+
| Availability: A-S | Availability: A-N
| |
| |
| | +---------------+
| (VM) | | |
| COTS.............................| vSwitch 1 |
| Server............... ......| |
| | | \ / |(X) (X) .. (X) |
| | | \ / +---------------+
| | | \ /
| | | \ +---------------+
| | | / \ | vSwitch 2 |
| (VM) | / \ | |
| COTS................/ \......|(X) (X) .. (X) |
| Server............................| |
| | | |
+--------------------+ +---------------+
Figure 1: System Availability - Network Part and Server Part
If the overall system availability is denoted by the symbol (A-Sys),
the overall system availability will be the product of server part of
the system availability (A-S) and the network part of the system
availability (A-N).
EQ(1) ... ... ... A-Sys = [A-S x A-N]
Given the fact that both A-S and A-N are "less than" 1 (one), we have
A-Sys "less than" A-S and A-Sys "less than" A-N. In another words,
if FIVE 9s are required for system availability, both the server part
and the network part of the availability need to be better than FIVE
9s so their products can be more than FIVE 9s.
To improve the network part of the network availability, as
illustrated in Figure 1, the normal 1+1 protection scheme is
utilized. It shall be noted that it is possible for the vSwitch to
cover long distance transmission network to connect multiple data
centers.
The mechanisms in the server part for improving availability is not
specified. In this study, it is assumed that one active server will
be supported by one or two backup servers. Normally, if the active
server is faulty, one of the backup server(s) will take over the
responsibility and hence there will be no loss of availability on the
server part.
Mo & Khasnabish Expires April 19, 2016 [Page 6]
Internet-Draft NFV Reliability using COTS Hardware October 2015
There is a significant difference between the NFV environment and the
dedicated traditional telecom equipment related to the time to
recover from the server fault. In the traditional telecom equipment
case, a manual change of some equipment (e.g. a faulty board) is
normally required and hence the time for restoration after
experiencing fault, normally denoted as MTTR (Mean Time to Repair) is
long.
In the NFV environment, the time for restoration is the time required
to boot another virtual machine (VM) with the needed software and re-
synchronization of data. Hence the MTTR in the NFV environment can
be considered to be shorter than the traditional telecom equipment.
More importantly, the MTTR in the NFV environment can be considered
to be a fixed constant.
It is also understood that multiple servers will be active to share
the load. Contradictory to common sense believe, this arrangement
will not increase nor decrease the overall network availability if
those active servers are supported by one or two backup servers.
This fact will be elaborated in the later section from both
theoretical point of view and simulations.
4. Network Part of the Availability
The traditional analysis can be applied to the network part of the
availability. In fact, the network part of the availability is
impacted by the availability of the switch which is part of the
vSwitch and the maximum hops in the vSwitch. The vSwitch is to
connect the VMs in the NFV environment.
If A-n is the denote the availability of the network element, for a
vSwitch with maximum of h hops, the availability of the vSwtich would
be "(A-n)^h." Hence, considering the 1+1 configuration of the
vSwtich, the A-N can be expressed by
EQ(2) ... ... ... A-N = [1 - (1 - (A-n)^h)^2 ]
The network availability, as a function of number of hops (h) and the
per network element availability (A-n), can be illustrated by using
teh diagrm as shown in Figure-2.
While this 3-D illustration shows the general trend in network
availability, the following data table is able to give more details
regarding the network availability with different hop counts and
different network element availability, as shown in Table-1.
Table-1: Network Part of System Availability with Various Network
Mo & Khasnabish Expires April 19, 2016 [Page 7]
Internet-Draft NFV Reliability using COTS Hardware October 2015
Elements Availability and Hop Counts
+-------------+---------+----------+----------+----------+----------+
| Network | 10 | 16 | 22 | 26 | 30 |
| Element | | | | | |
| Availabilit | | | | | |
| y/ Hop | | | | | |
+-------------+---------+----------+----------+----------+----------+
| 0.99 | 0.99086 | 0.977935 | 0.96065 | 0.94712 | 0.932244 |
| 0.999 | 0.99990 | 0.999748 | 0.999526 | 0.999341 | 0.999126 |
| 0.9999 | 0.99999 | 0.999997 | 0.999995 | 0.999993 | 0.999991 |
| 0.99999 | 1 | 1 | 1 | 1 | 1 |
+-------------+---------+----------+----------+----------+----------+
+------------------------------------------+
/ / |
/ Five 9s / |
/ / |
/ ... ... -/ |
/ ... ... ... ..... / |
+-----------------------------------------+ +..0.99999
| . .... | /
| .... .... ... . . . . . | /0.9999
| . | / Net Element
| .| / Availability
| | / 0.999
| |/
+-----------------------------------------+..0.99
2 8 12 18 24
..............Hop Count..........>
Figure 2: Network Part of the System Availability with Different Hop
Counts and Different Network Element Availability
In order to achieve FIVE 9s reliability normally demanded by the
telecommunication operators, the network element reliability needs to
be at least FOUR 9s if hop counts is more than 10.
In fact, in order to achieve FIVE 9s while per network element
availability is only THREE 9s, the hop count needs to be less than
two, which is deemed non-practical.
Mo & Khasnabish Expires April 19, 2016 [Page 8]
Internet-Draft NFV Reliability using COTS Hardware October 2015
5. Theoretical Analysis of Server Part of System Availability
In GR-77 [1], extensive analysis has been performed for systems under
various conditions. In the NFV environment, if the server's
availability is denoted as the symbol (Ax), the server part of the
system availability (As), with a 1+1 master and slave configuration,
can be given by [1] in Part D, Chapter 6.
EQ(3) ... ... ... As = [1-((1-Ax)^2)]
In a more practical environment, there will be silent errors (errors
can not be detected by the system under consideration). The silent
error probability will be expressed as the symbol (Pse)
We need to further assume that the silent error only affects the
master of the system because it is the one which has the ability to
corrupt the data. This assumption can be further articulated, in
practical engineering terms, is that "when there is error detected
and there is no obvious cause of the error, the master of the
"master-salve" configuration will assume the master is correct while
the slave will go through a MTTR time to recover."
The state transition can be illustrated as in the following diagram:
Figure 3: State Transition for System with only one Backup, ...
(Note: a dot-and-dash version of the diagram is being developed)....
With this state transition diagram outlined in Figure 1, the system
availability in a 1+1 master-slave configuration can be expressed as
follows.
EQ(4a)... ... ... ... As = [1-((1-Ax)^2 + PseAx(1-Ax))
EQ(4b)... ... ... ... As = (2-Pse)Ax - (1-Pse)((Ax)^2))
The following diagram (Figure 4) illustrates the server part of the
availability with different per server availability and different
silent error probability.
Figure 4: Server Part of the System Availability with Various Server
Availability and Silent Error Probability ... (Note: a dot-and-dash
version of the diagram is being developed)....
While the graphics illustrate the trends, the following data table
will give precise information on a single backup (1+1) configuration.
Table-2: Server Part of availability for different silent error
probability and different server availability for single backup
Mo & Khasnabish Expires April 19, 2016 [Page 9]
Internet-Draft NFV Reliability using COTS Hardware October 2015
configuration
+-------+---------+-----------+-------------+----------+
| SEPSA | 0.99000 | 0.99900 | 0.99990 | 0.99999 |
+-------+---------+-----------+-------------+----------+
| 0.0 | 0.9999 | 0.999999 | 0.99999999 | 1.0 |
| 0.1 | 0.99891 | 0.9998991 | 0.999989991 | 0.99999 |
| 0.2 | 0.99792 | 0.9997992 | 0.999979992 | 0.999998 |
| 0.3 | 0.99693 | 0.9996993 | 0.999969993 | 0.999997 |
| 0.4 | 0.99594 | 0.9995994 | 0.999959994 | 0.999996 |
| 0.5 | 0.99495 | 0.9994995 | 0.999949995 | 0.999995 |
| 0.6 | 0.99396 | 0.9993996 | 0.999939996 | 0.999994 |
| 0.7 | 0.99297 | 0.9992997 | 0.999929997 | 0.999993 |
| 0.8 | 0.99198 | 0.9991998 | 0.999919998 | 0.999992 |
| 0.9 | 0.99099 | 0.9990999 | 0.999909999 | 0.999991 |
| 1 | 0.99 | 0.999 | 0.9999 | 0.99999 |
+-------+---------+-----------+-------------+----------+
In the above table, the entries in the right-most (5th) column, the
top entry in the 4th column, and the top-most entry in the 3rd
column, outline the area which five 9 availability is possible. As
evidenced in the above table, the server part of the network
availability deteriorates rapidly with silent error probability. It
is possible to achieve five 9s of availability with only server
availability of only three 9s, it does demand five 9s server
availability when the silent error probability is only 10%.
While the 1+1 configuration illustrated above seems reasonable for
server part of the system availability (As), there may be cases
demanding more than 1+1 configuration for reliability. For systems
with two backups, the availability, without consideration of the
silent error, can be expressed as [1] (Part D, chapter 6)
EQ(5) ... ... ... As = [1-((1-Ax)^3)]
With the introduction of the silent error probability, the error
transition can be expressed in the following diagram:
Figure 5: Error State Transition for System with only two Backups ...
(Note: a dot-and-dash version of the diagram is being developed)....
With the introduction of the silent error and observing the error
transition above, assuming the silent error event and the server
fault event are independent (e.g., A software error as cause of the
silent error and the server fault event as a hardware failure), the
server part of the availability for dual backup case can be given by
EQ(6a)... ... As = 1-((1-Ax)^3 + Pse(1 - Ax)((Ax)^2 + 2Ax(1-Ax)))
Mo & Khasnabish Expires April 19, 2016 [Page 10]
Internet-Draft NFV Reliability using COTS Hardware October 2015
EQ(6b)... ... As = (3-2Pse)Ax - 3(1-Pse)(Ax)^2 + (1-Pse)(Ax)^3
It should be noted that, when "Pse = 1" for both EQ (4) and EQ (6),
the server part of the system availability (As) and the server
availability (Ax) are the same. This relationship shall be expected
since, if the mater always experiences the silent error, the backups
are useless and will be corrupted all the time.
The system availability, with dual backup, can be illustrated as
follows for different server availability and different silent error
including software malfunctions.
Figure 6: Server Part of the System Availability with Various Silent
Error Probability and Server Availability for a dual Backup System
... (Note: a dot-and-dash version of the diagram is being
developed)....
As with previous case, the diagram will only illustrate the trend
while the following table will provide precise data for the system
availability under different silent error probability and server
availabilities for dual backup case
Table-3: System Availability with different silent error probability
and server availability (SEPSA) for dual backup configuration
+-------+-----------+-------------+---------+----------+
| SEPSA | 0.99000 | 0.99900 | 0.99990 | 0.99999 |
+-------+-----------+-------------+---------+----------+
| 0.0 | 0.999999 | 0.999999999 | 1.0 | 1.0 |
| 0.1 | 0.9989991 | 0.999899999 | 0.99999 | 0.999999 |
| 0.2 | 0.9979992 | 0.999799999 | 0.99998 | 0.999998 |
| 0.3 | 0.9969993 | 0.999699999 | 0.99997 | 0.999997 |
| 0.4 | 0.9959994 | 0.999599999 | 0.99996 | 0.999996 |
| 0.5 | 0.9949995 | 0.9995 | .99995 | 0.999995 |
| 0.6 | 0.9939996 | 0.9994 | 0.99994 | 0.999994 |
| 0.7 | 0.9929997 | 0.9993 | 0.99993 | 0.999993 |
| 0.8 | 0.9919998 | .9992 | .99992 | 0.999992 |
| 0.9 | 0.9909999 | 0.9991 | 0.99991 | 0.999991 |
| 1.0 | 0.99 | 0.999 | 0.9999 | 0.999990 |
+-------+-----------+-------------+---------+----------+
As shown in Table-3, the entries in the right-most (5th) column, top
two entries in the 4th column, and topmost entries in the 2nd and 3rd
columns, represent the five 9s capabilities. Comparing those two
tables, the dual backups are of marginal advantage over the single
backup except for the case there is no silent error. In this case,
with only two 9s server availability, the five 9s server part of
system availability can be achieved.
Mo & Khasnabish Expires April 19, 2016 [Page 11]
Internet-Draft NFV Reliability using COTS Hardware October 2015
From the data above, we can conclude that the silent error,
introduced by software error or hardware error not detectable by
software, plays an important role in the server part of the system
availability and hence the final system availability. In fact, it
will be the dominant elements if Pse is more than 10% when the
difference between single backup and dual backup are not significant.
Some operators are of the opinion that there need to be a new
approach to the availability requirements. COTS hardware are assumed
to have less availability than the traditional telecom hardware.
But, in the NFV environment, since each server (or VM) in the NFV
environment will only affect a small number of users, the
requirements for traditional five 9s could be relaxed while keeping
the same user experience in downtime. In another words, the weighted
downtime, in proportion of the number of users, may be reduced in the
NFV environment due to each server affecting only a small number of
users for a given server reliability.
Unfortunately, from the theoretical point of view, this is not true.
It is possible that, each server downtime will only affect a small
number of users. But the multiple active servers will experience
more server fault opportunities (this is similar to the famous
reliability argument for the twin engine Boeing 777). As long as the
protection scheme, or more importantly, the number of backup(s) are
the same, the eventual system availability will be the same,
regardless what portion of users each server is to serve.
6. Simulation Study of Server Part of Availability
In the above theoretical analysis of server part of availability, the
following factors are not considered: (A) Site Maintenance (e.g.
software upgrade, patch, etc. affecting the whole site, (B) Site
Failure (earth quake, etc.)
While traditional telecom grade equipment putting a lot of emphasis
and engineering complexity to ensure smooth migration, smooth
software upgrade, and smooth patching procedures, the COTS hardware
and its related software are notorious in lacking such capabilities.
This is the primary reason for operators to be hesitate on utilizing
COTS hardware, even though COTS hardware in the NFV environment does
having the improved MTTR as compared to the traditional telecom
hardware.
While it is relative easy to obtain a closed for of system
availability for the ideal case without site related issues, it is
extremely difficult to obtain an analytical solution when site issues
are involved. In this case, we resort to numerical simulation under
Mo & Khasnabish Expires April 19, 2016 [Page 12]
Internet-Draft NFV Reliability using COTS Hardware October 2015
reasonable assumptions [2, 3, 4].
6.1. Methodology
In this section, the various assumptions and the outline of the
simulation mechanisms will be discussed.
A discrete event simulator is constructed to obtain the availability
for the server part. In the simulator, an active server (master
server which processing the network traffic) will be supported by 1
(single backup) or 2 (dual backup) servers in another site(s).
For the failure probability of the server, it is common to assume the
bathtub probability distribution (WeiBull distribution). In
practice, we need to enforce that the NFV management will provide
servers which is on the flat part of the bathtub distribution. In
this case, the familiar exponential distribution can be utilized.
In the discrete event simulator, each server will be scheduled to
work for a certain duration of time. This duration will be a random
variable with exponential distribution which is common to measure the
server behavior during its useful life cycle, with mean given by the
MTBF of the server.
In fact, the flat part of the bathtub distribution can related to the
normal server MTBF (mean time between failures) with the failure
density function expressed as F(x)=[(1/MTBF) times (e-to-the-power(-
x/MTBF))].
After the working duration, the server will be down for a fixed time
duration which represents the time duration to start another virtual
machine to replace the one in trouble. This part is actually
different from the traditional telecom grade equipment. Here, the
assumption is that there will be another server available to replace
the one went down. Hence, regardless the nature of the fault, the
down time for a server fault will be fixed which represent the time
needed to have another server ready to take over the task.
The following diagram shows this arrangement for a system with only
one backup. It shall be noted that, while the server up time
duration is variable, the server down time will be fixed.
Figure 7: The life of the Servers ... (Note: a dot-and-dash version
of the diagram is being developed)....
The servers will be hosted in "sites" which is considered to be data
centers. In this simulation, during initial setup, the servers
supporting each other for reliability purposes will be hosted in
Mo & Khasnabish Expires April 19, 2016 [Page 13]
Internet-Draft NFV Reliability using COTS Hardware October 2015
different sites. This is to minimize the impact of the site failure
and site maintenance.
In order to model the system behavior with one or two backups, the
concept of protection group is introduced.
A protection group will consists of a "master" server with one or two
"slave" server(s) in other site(s). There may be multiple protection
groups inside the network with each protection group serving a
fraction of the users.
A protection group will be considered to be "down" if every server in
this group is dead. During the time the protection group is "down",
the network service will be affected and the network is considered to
be "down" for the group of users this protection group is responsible
for.
The uptime and downtime of the protection group will be recorded in
the discrete event simulator. The server part of the availability is
given by (where the total elapsed time is the total simulation time
in the discrete event simulator)
EQ(7) ... ... Availability(server part)= [(PGUP)/(TET)], whereas
o PGUP is Protection Group Up Time
o TET is the Total Elapsed Time
The concept of protection group, site, and server can be illustrated
as follows (Figure 8) for a system with two backups. It shall be
noted that the protection group is an abstract concept and the
portion of the network function is not available if and only if the
all the servers in the protection group is not functioning.
Figure 8: Servers, Sites, and Protection Group ... (Note: a dot-and-
dash version of the diagram is being developed)....
Even though the simulator will allow each site to have a number of
servers, which is configurable, there is little use for this
arrangement. The system availability will not change regardless how
many servers per site is used to support the system, as long as there
is no change in the number of servers in the protection group. The
increase of number of servers per site is essentially increase the
number of protection groups. For a long time duration, each
protection group will experience the similar downtime for the same up
time (or will have the same availability).
As in the theoretical analysis, the silent error, due to software or
Mo & Khasnabish Expires April 19, 2016 [Page 14]
Internet-Draft NFV Reliability using COTS Hardware October 2015
subtle hardware failure, will only affect the active (or master)
server. When the master server failed with silent error, both the
master and "slave" servers will go through a MTTR time to recover
(e.g. time to incarnate two VMs simultaneously). In this case, this
part of the system (or this protection group) is considered to be
under fault.
In the reliability study, the focus is the number of the backups for
each protection group where 1+1 configuration is a typical
configuration for one backup mechanism. For load sharing arrangement
such as 1:1, it can be viewed as two protection groups.
In general, the load sharing scheme will have less availability
because, in 1:1 case, any server fault will result in two faults in
different protection groups. This can be extended to 1:2 case where
three protection groups are involved, and any server fault will
introduce three faults in different protection groups. In this
study, the load sharing mechanisms will not be elaborated further.
The site will also go though its maintenance work. The traditional
telecom grade equipment and the COTS hardware mainly defers on this
part. In Telecom grade equipment, minimum impact on system
performance or system availability is to be maintained during the
maintenance window. But, for COTS hardware, the maintenance work may
be more frequently and more destructive.
In order to simulate the maintenance aspect of COTS hardware, the
simulator will put the site "under maintenance" at random time. The
interval for the site to be working is also assumed to be
exponentially distributed random variable, with mean to be
configurable in the simulator. The duration of the maintenance is
also a uniform distributed random variable with a configured mean,
minimum, and maximum.
In order to put a site "under maintenance", there shall be no-fault
inside the network. All the servers on the site to be "under
maintenance" will be moved to other site. Hence, no traffic will be
impacted during the process of putting the site under maintenance.
Of course, the ability against site failure when some site is under
maintenance will be reduced.
When a site is back from maintenance, it will attempt to claim all
its server responsibilities transferred due to site maintenance.
o For each protection group, if every server is working, the
protection group will re-arrange the protect relationship so each
site will only have one server in the protection group. The new
server on the site back from maintenance will need a MTTR time to
Mo & Khasnabish Expires April 19, 2016 [Page 15]
Internet-Draft NFV Reliability using COTS Hardware October 2015
be ready for backup. In this case, no loss of service in the
system.
o For each protection group, if there are at least one working and
at least on in fault condition, one working server will be added
to the protection group. The new server on the site back from
maintenance will need a MTTR time to be ready for backup. In this
case, no loss of service in the system.
o For each protection group, if there is no servers working, the
protection group will gain a working server from the site back
from the maintenance. The new server on the site back from
maintenance will need a MTTR time to be ready for service. In
this case, the system will provide service after the new server is
ready.
A site can also under fault (e.g. loss of power, operating under
reduced capability due to thermal issues, and earth quake). The
simulator can also simulate the effect of such events, with site up
duration as an exponentially distributed random variable with mean to
be configured. The site failure duration is expressed as a uniform
distributed random variable with configurable mean, minimum, and
maximum.
6.2. Validation of the Simulator
In order to verify the correctness of the simulator (e.g. the random
number generator, the whole program structure, etc.), the simulation
is performed with various server availability and various silent
error probability.
For single backup case, the error between the theoretical data and
simulation data for system availability on the server part can be
illustrated by the following diagram (Figure 9).
Figure 9: Verification of Simulator for Single Backup Case ...
(Note: a dot-and-dash version of the diagram is being developed)....
As we can see, the magnitude of the errors are within 10-to-the-
power-(-5) which is very small, considering the nominal value of
system availability for the server part is close to 1.0. For the
dual backup case, the error between the simulated and theoretical
system availability for different silent error probability and server
availability can be illustrated as follows (Figure 10).
Figure 10: Verification of Simulator for Dual Backup Case ... (Note:
a dot-and-dash version of the diagram is being developed)....
Mo & Khasnabish Expires April 19, 2016 [Page 16]
Internet-Draft NFV Reliability using COTS Hardware October 2015
This is also similar to that of the single backup case where the
error are within the range. Those error information gives us the
needed confidence on the simulation result for complicated case where
analytical solutions are evasive.
6.3. Simulation Results
The effect of the MTTR in the NFV environment is studied first. In
this study, the effect of the MTTR and the silent error probability
can be shown below:
Figure 11: Availability with Various Silent for different MTTRs...
(Note: a dot-and-dash version of the diagram is being developed)....
In the diagram (Figure 11), R6 represents MTTR of 6 minutes while R60
represents MTTR of 60 minutes. The x-axis is the silent error
probability. As shown, the effect of the MTTR (time to recover from
a fault or time to have VM rebirth) will affect the slope of the
system availability, which decline with the increase of silent error
probability. In the above example, the server MTBF is assumed to be
10000 hours which represents the server availability of 0.9994 for R6
case and 0.994 for the R60 case.
The two curves starting approximate 1.0 are the system availability
with dual backups while the other two are for system availability
with single backup. It should be noted that, for the dual backup
case, there is little difference in availability for different MTTR
when there is no silent error. Intuitively, this is expected due to
the added number of backup servers.
In this simulation, both site failure (with mean time between
failures of 20000 hours) and site maintenance (with mean time between
site maintenance of 1000) are considered. The mean time for site
failure duration is assumed to be 12 hours (uniform distributed
between 4 hours and 24 hours) and the mean time for site maintenance
is 24 hours (uniform distributed between 4 hours and 48 hours).
The next step is to evaluate the impact of the site issues (site
failure, maintenance). For a very bad site outlined above, which has
the mean time between site failures to be 2 times of the server MTBF
and the mean time between site maintenance events is assumed to be
0.1 times of the server MTBF. The availability on the server part
can be illustrated with different silent error probability and server
availability for the single backup configuration.
Figure 12: Availability for the Server Part in Single Backup
Configuration... (Note: a dot-and-dash version of the diagram is
being developed)....
Mo & Khasnabish Expires April 19, 2016 [Page 17]
Internet-Draft NFV Reliability using COTS Hardware October 2015
As the data will illustrate that, in order to achieve high
availability, the server availability needs to be very high. In
fact, the server availability needs to be in the range of FIVE 9s in
order to achieve the system availability of FIVE 9s under various
site related issues. The dual backup systems for exactly the same
configuration, the result will be better and can be illustrated as
follows:
Figure 13: Availability for the Server Part in Single Backup
Configuration... (Note: a dot-and-dash version of the diagram is
being developed)....
With server availability of FOUR 9s, and with low silent error
probabilities, the server part of the availability can achieve FIVE
9s. For a site with less issues, such as the one with mean time
between failures is 100 times of the server MTBF and site maintenance
is 0.1 times of the server MTBF. The mean time for site failure
duration is also assumed to be 12 hours (uniform distributed between
4 hours and 24 hours) and the mean time for site maintenance is 24
hours (uniform distributed between 4 hours and 48 hours). The result
for the single backup system can be shown as follows:
Figure 14: Server Part of Availability for a Good Site on Single
Backup... (Note: a dot-and-dash version of the diagram is being
developed)....
The following data table (Table-4) will give precise information
regarding this simulation results.
Table-4: Details Regarding Availability on Server Part for Single
Backup on a Good Site
+-------------------+----------+----------+------------+------------+
| Silent | 0.990099 | 0.999001 | 0.99990001 | 0.99999 |
| Error/Server | | | | |
| Availability | | | | |
+-------------------+----------+----------+------------+------------+
| 0.0 | 0.998971 | 0.999959 | 0.9999992 | 1.0 |
| 0.1 | 0.997918 | 0.999857 | 0.99998959 | 0.99999901 |
| 0.2 | 0.996908 | 0.999771 | 0.99997957 | 0.99999804 |
| 0.3 | 0.995999 | 0.999674 | 0.99996935 | 0.99999695 |
+-------------------+----------+----------+------------+------------+
As evidenced in the table above, the server part of the system
availability will be impacted by the silent error and a single
redundant hardware will only provide marginal improvement when the
silent error probability is small.
Mo & Khasnabish Expires April 19, 2016 [Page 18]
Internet-Draft NFV Reliability using COTS Hardware October 2015
Figure 15: Server Part of Availability for a Good Site on Dual
Backup... (Note: a dot-and-dash version of the diagram is being
developed)....
The diagram above give a general trend in system availability and the
follow data table will precise the data.
Table 5: Details Regarding Availability on Server Part for Dual
Backup on a Good Site
+---------------+------------+------------+------------+------------+
| Silent | 0.99009901 | 0.999001 | 0.99990001 | 0.99999 |
| Error/Server | | | | |
| Availability | | | | |
+---------------+------------+------------+------------+------------+
| 0.0 | 0.9999939 | 0.99999998 | 1.0 | 1.0 |
| 0.2 | 0.9981346 | 0.99980209 | 0.99998048 | 0.99999792 |
| 0.4 | 0.99615083 | 0.99960136 | 0.99996002 | 0.99999594 |
| 0.5 | 0.99522474 | 0.9995184 | 0.99995225 | 0.99999503 |
+---------------+------------+------------+------------+------------+
From the tables for single and dual backup, we can see that dual
backup only provides marginal benefit in the face of site issues.
Given the fact that site issues are inevitable in practice, a
geographically distributed single backup system is recommended for
simplicity.
6.4. Multiple Servers Sharing the Load
In this section, we outline the simulation results for cases when
there are multiple servers to take care of the active work load. In
this case, the impact of a protection group failure will affecting
smaller number of users.
In the simulation, each site will have N servers to serve the work.
A weighted uptime and weighted down time was introduced. The system
availability is the weighted uptime divided by the total of weighted
uptime and weighted downtime.
EQ(8)... ... Weighted-Availability[Server-Part]=[(TET - WDT)/TET],
whereas
o TET is the Total Elapsed Time
o WDT is the Weighted Down Time
If any protection group (i) is down, the WDT will be updated as
follows:
Mo & Khasnabish Expires April 19, 2016 [Page 19]
Internet-Draft NFV Reliability using COTS Hardware October 2015
EQ(9)... ... WDT = WDT + [Protection Group (i) Down Time]/N
For a system with three protection groups (i.e. the servers sharing
the workload), the availability of each protection group, as well as
the weighted availability, are obtained as follows (Table-6):
Table-6: Availability of Protection Groups and the Weighted
Availability (Dual Backup)
+--------------+---------+---------+----------+----------+----------+
| Availabilit | Availab | Availab | Availabi | Measured | Protecti |
| y /Silent | ility o | ility o | l ity of | Weighted | o n Grou |
| Error | fProtec | fProtec | Protect | Availabi | p Averag |
| Probabilit | ti on | ti on | io n Gro | l ity | e - |
| y | Group | Group | up 3 | | Weighte |
| | 1 | 2 | | | dAvailab |
| | | | | | il ity |
+--------------+---------+---------+----------+----------+----------+
| 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 |
| 0.2 | 0.99999 | 0.99999 | 0.999997 | 0.999998 | 6.66668E |
| | 8 015 | 8 005 | 9 85 | 0 01 | - 11 |
| 0.4 | 0.99999 | 0.99999 | 0.999995 | 0.999996 | -3.33333 |
| | 6 027 | 6 018 | 9 88 | 0 11 | E -11 |
+--------------+---------+---------+----------+----------+----------+
In this case, there is little difference between the different
protection groups. The weighted availability is actually the average
of availability of all the protection groups.This also illustrate the
fact that, regardless how many servers to share to active load, the
system availability will be the same as long as (A) The number of
backups are the same, and (B) Each server availability are the same
7. Conclusions
The system availability can be divided into two parts; the
availability from the network and the availability from the server.
The final system availability is the product of those two parts.
The system availability from the network is determined by the maximum
number of hops and individual network element availability, with the
fault tolerant setup is assumed to be 1+1. The system availability
from the server is mainly determined by the following parameters.
o Availability of each individual server
o Silent error probability
Mo & Khasnabish Expires April 19, 2016 [Page 20]
Internet-Draft NFV Reliability using COTS Hardware October 2015
o Site related issues (maintenance, fault)
o Protection Scheme (one or two dedicated backups)
The introduction of silent error is to take account of software error
and errors undetectable by hardware, the system availability on the
server part will be dominated by such silent error if the silent
error probability is more than 10%. This is shown in both
theoretical work and simulations.
It shall be interesting to note that the dual backup scheme provides
marginal benefits and the added complexity may not warrant such
practice in the real network.
It is possible for COTS hardware to provide as high availability as
the traditional telecom hardware if the server itself is of
reasonable high-availability. The undesirable attributes of COTS
hardware have been modelled into the site related issues, such as
site maintenance and site failure which is not applicable for
traditional telecom hardware. Hence, in calculating the server
availability, the site related issues are to be excluded.
It is critical for the virtualization infrastructure management to
provide as much hardware failure information as possible to improve
the availability of the application. As seen in both theoretical
work and simulation, the silent error probability becomes a dominant
factor in the final availability. The silent error probability can
be reduced if the virtualization infrastructure management is capable
of fault isolation.
8. Security considerations
To be determined.
9. IANA considerations
This Internet Draft includes no request to IANA.
10. Acknowledgements
Authors would like to thank the NFV RG chairs (Diego and Ramki) for
encouraging discussions and guidance.
11. References
Mo & Khasnabish Expires April 19, 2016 [Page 21]
Internet-Draft NFV Reliability using COTS Hardware October 2015
11.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/
RFC2119, March 1997,
<http://www.rfc-editor.org/info/rfc2119>.
[I-D.irtf-nfvrg-nfv-policy-arch]
Figueira, N., Krishnan, R., Lopez, D., Wright, S., and D.
Krishnaswamy, "Policy Architecture and Framework for NFV
Infrastructures", draft-irtf-nfvrg-nfv-policy-arch-01
(work in progress), August 2015.
[1] GR-77, "Applied R&M Manual for Defense Systems", 2012.
11.2. Informative References
[2] Papoulis, A., "Probability, Random Variables, and
Stochastic Processes", 2002.
[3] Bremaud, P., "An Introduction to Probabilistic Modeling",
1994.
[4] Press, et al, W., "Numerical Recipes in C/C++", 2007.
Authors' Addresses
Li Mo
ZTE (TX) Inc.
2425, N. central expressway
Richardson, TX 75080
USA
Phone: +1-972-454-9661
Email: li.mo@ztetx.com
Bhumip Khasnabish (editor)
ZTE (TX) Inc.
55 Madison Avenue, Suite 160
Morristown, New Jersey 07960
USA
Phone: +001-781-752-8003
Email: vumip1@gmail.com, bhumip.khasnabish@ztetx.com
URI: http://tinyurl.com/bhumip/
Mo & Khasnabish Expires April 19, 2016 [Page 22]