retana-network-complexity-framework-00.txt

Internet DRAFT - draft-retana-network-complexity-framework
draft-retana-network-complexity-framework

Last Version:	draft-retana-network-complexity-framework-00.txt	Tracker Entry
Date:	`15-Mar-2013`
Disposition:	.draft-behringer-ncrg-complexity-framework





Network Working Group                                          A. Retana
Internet-Draft                                             Cisco Systems
Intended status: Standards Track                                R. White
Expires: September 15, 2013                                     Verisign
                                                          March 14, 2013


              A Framework for Measuring Network Complexity
            draft-retana-network-complexity-framework-00.txt

Abstract

   Network architecture revolves around the concept of fitting the
   design of a network to its purpose; of asking the question, "what
   network will best fit these needs?"  A part of fitting network design
   to requirements is the problem of complexity, an idea often measured
   by "seat of pants" methods.  When would adding a particular protocol,
   policy, or configuration be "too complex?"  This document suggests a
   series of continuums along which network complexity might be
   measured.  No suggestions are made in measuring complexity for each
   of these continuums are provided; this is left for future documents.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on September 15, 2013.

Copyright Notice

   Copyright (c) 2013 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents



Retana & White         Expires September 15, 2013               [Page 1]

Internet-Draft        Measuring Network Complexity            March 2013


   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Requirements notation . . . . . . . . . . . . . . . . . . . .   3
   3.  Control Plane State verses Optimal Forwarding Paths (Stretch)   3
   4.  Configuration State verses Failure Domain Separation  . . . .   4
   5.  Policy Centralization verses Optimal Policy Application . . .   6
   6.  Configuration State verses Per Hop Forwarding Optimization  .   7
   7.  Reactivity verses Stability . . . . . . . . . . . . . . . . .   7
   8.  Conclusion  . . . . . . . . . . . . . . . . . . . . . . . . .   9
   9.  Security Considerations . . . . . . . . . . . . . . . . . . .   9
   10. Normative References  . . . . . . . . . . . . . . . . . . . .   9
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  10

1.  Introduction

   Network complexity is a systemic, rather than component level,
   problem; complexity must be measured in terms of the multiple moving
   parts of a system, and complexity may be more than the complexity of
   the individual pieces, examined individually, might suggest.  There
   are two basic ways in which systemic level problems might be
   addressed: interfaces and continuums.  In addressing a systemic
   problem through interfaces, we seek to treat each piece of the system
   as a "black box," and develop a complete understanding of the
   interfaces between these black boxes.  In address a systemic problem
   as a continuum, we seek to understand the impact of a single change
   or element to the entire system as a set of tradeoffs.  While network
   complexity can profitably be approached from either of these
   perspectives, the authors of this document have chosen to approach
   the systemic impacts of network complexity from the perspective of
   continuums of tradeoffs.  In theory, modifying the network to resolve
   one particular problem (or class of problems) will add complexity
   which results in the increased liklihood (or appearance) of another
   class of problems.  Discovering these continuums of tradeoffs, and
   then determining how to measure each one, become the key steps in
   understanding and measuring systemic complexity in this view.

   This document proposes five such continuums; more may be possible.
   Others may be added into this document in future revisions, or
   documented in other drafts, as circumstances dictate.





Retana & White         Expires September 15, 2013               [Page 2]

Internet-Draft        Measuring Network Complexity            March 2013


   o  Control Plane State verses Optimal Forwarding Paths (or it's
      opposite measure, stretch)

   o  Configuration State verses Failure Domain Separation

   o  Policy Centralization verses Optimal Policy Application

   o  Configuration State verses Per Hop Forwarding Optimization

   o  Reactivity verses Stability

   Each of these continuums is described in a separate section of this
   draft.

2.  Requirements notation

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", "OPTIONAL" in this
   document are to be interpreted as in [RFC2119].

3.  Control Plane State verses Optimal Forwarding Paths (Stretch)

   Control plane state is the aggregate amount of information carried by
   the control plane through the network in order to produce the
   forwarding table at each device.  Each additional piece of
   information added to the control plane --such as more specific
   reachability information, policy information, additional control
   planes for virtualization and tunneling, or more precise topology
   information-- adds to the complexity of the control plane.  This
   added complexity, in turn, adds to the burden of monitoring,
   understanding, troubleshooting, and managing the network.  Removing
   control plane state, however, is not always a net positive gain for
   the network as a system; removing control plane state almost always
   results in decreased optimality in the forwarding and handing of
   packets travelling through the network.  This decreased optimality
   can be termed stretch, which is defined as the difference between the
   absolute shortest (or best) path traffic could take through the
   network and the path the traffic actually takes through the network.
   Stretch is expressed as the difference between the optimal and actual
   path.  The figure below provides and example of this tradeoff.











Retana & White         Expires September 15, 2013               [Page 3]

Internet-Draft        Measuring Network Complexity            March 2013


                                R1-------+
                                |        |
                                R2       R3
                                |        |
                                R4-------R5
                                |
                                R6


   Assume each link is of equal cost in this figure, and:

   o  R4 is advertising 192.0.2.1/32 as a reachable destination not
      shown on the diagram

   o  R5 is advertising 192.0.2.2/32 as a reachable destination not
      shown on the diagram

   o  R6 is advertising 192.0.2.3/32 as a reachable destination not
      shown on the diagram

   For R1, the shortest path to 192.0.2.3/32, advertised by R6, is along
   the path [R1,R2,R4,R6].  Assume, however, the network administrator
   decides to aggregate reachability information at R2 and R3,
   advertising 192.0.2.0/24 towards R1 from both of these points.  This
   reduces the overall complexity of the control plane by reducing the
   amount of information carried past these two routers (at R1 only in
   this case).  Aggregating reachability information at R2 and R3,
   however, has the impact of making both routes towards 192.168.2.3/32
   appear as equal cost paths to R1; there is no particular reason R1
   should choose the shortest path through R2 over the longer path
   through R3.  This, in effect, increases the stretch of the network.
   The shortest path from R1 to R6 is 3 hops, a path that will always be
   chosen before aggregation is configured.  Assuming half of the
   traffic will be forwarded along the path through R2 (3 hops), and
   half through R3 (4 hops), the network is stretched by ((3+4)/2) - 3),
   or .5, a "half a hop."

   Traffic engineering through various tunneling mechanisms is, at a
   broad level, adding control plane state to provide more optimal
   forwarding (or network utlization).  Optimizing network utilization
   may require detuning stretch (intentionally increasing stretch) to
   increase overall network utilization and efficiency; this is simply
   an alternate instance of control plane state (and hence complexity)
   weighed against optimal forwarding through the network.

4.  Configuration State verses Failure Domain Separation





Retana & White         Expires September 15, 2013               [Page 4]

Internet-Draft        Measuring Network Complexity            March 2013


   A failure domain, within the context of a network control plane, can
   be defined as the set of devices impacted by a change in the network
   topology or configuration.  A network with larger failure domains is
   more prone to cascading failures, so smaller failure domains are
   normally preferred over larger ones.  The primary manes used to limit
   the size of a failure domain within a network's control plane is
   information hiding; the two primary types of information hidden in a
   network control plane are reachability information and topology
   information.  An example of aggregating reachability information is
   summarizing the routes 192.0.2.1/32, 192.0.2.2/32, and 192.0.2.3/32
   into the single route 192.0.2.0/24, along with the aggregation of the
   metric information associated with each of the component routes.
   Note that aggregation is a "natural" part of IP networks, starting
   with the aggregation of individual hosts into a subnet at the network
   edge.  An example of topology aggregation is the summarization of
   routes at a link state flooding domain boundary, or the complete
   failure to advertise topology information in a distance-vector
   protocol.

   While limiting the size of failure domains appears to be an absolute
   good in terms of network complexity, there is a definite tradeoff in
   configuration complexity.  The more failure domain edges created in a
   network, the more complex configuration will become.  This is
   particularly true is redistribution of routing information between
   multiple control plane processes is used to create failure domain
   boundaries; moving between different types of control planes causes a
   loss of the consistent metrics most control planes rely on to build
   loop free paths.  Redistribution, in particular, opens the door to
   very destructive positive feedback looks within the control plane.
   Examples of control plane complexity caused by the creation of
   failure domain boundaries include route filters, routing aggregation
   configuration, and metric modifications to engineer traffic across
   failure domain boundaries.


















Retana & White         Expires September 15, 2013               [Page 5]

Internet-Draft        Measuring Network Complexity            March 2013


   Returning to the network described in the previous section,
   aggregating routing information at R2 and R3 will divide the network
   into two failure domains: (R1,R2,R3), and (R2,R3,R4,R5).  A failure
   at R5 should have no impact on the forwarding information at R1.  A
   false failure domain separation occurs, however, when the metric of
   the aggregate route advertised by R2 and R3 is dependent on one of
   the routes within the aggregate.  For instance, if the metric of the
   192.0.2.0/24 aggregate is taken from the metric of the component
   192.0.2.1/32, then a failure of this one component will cause changes
   in the forwarding table at R1 --in this case, the control plane has
   not truly been separated into two distinct failure domains.  The
   added complexity in the illustration network would be the management
   of the configuration required to aggregate the contorl plane
   information, and the management of the metrics to ensure the control
   plane is truly separated into two distinct failure domains.

   Replacing aggregation with redistribution adds the complexity of
   managing the feedback of routing information redistributed between
   the failure domains.  For instance, if R1, R2, and R3 were configured
   to run one routing protocol, while R2, R3, R4, R5, and R6 were
   configured to run another protocol, R2 and R3 could be configured to
   redistribute reachability information between these two control
   planes.  This can split the control plane into multiple failure
   domains (depending on how, specifically, redistribution is
   configured), but at the cost of creating and managing the
   redistribution configuration.  Futher, R3 must be configured to block
   routing information redistributed at R2 towards R1 from being
   redistributined (again) towards R4 and R5.

5.  Policy Centralization verses Optimal Policy Application

   Another broad area where control plane complexity interacts with
   optimal network utilization is Quality of Service (QoS).  Two
   specific actions are required to optimize the flow of traffic through
   a network: marking and Per Hop Behaviors (PHBs).  Rather than
   examining each packet at each forwarding device in a network, packets
   are often marked, or classified, in some way (typically through Type
   of Service bits) so they can be handled consistently at all
   forwarding devices.  Packet marking policies must be configured on
   specific forwarding devices throughout the network.  Distributing
   marking closer to the edge of the network necessarily means
   configuring and managing more devices, but produces optimal
   forwarding at a larger number of network devices.  Moving marking
   towards the network core means packets are marked for proper handling
   across a smaller number of devices.  In the same way, each device
   through which a packet passes with the correct PHBs configured
   represents an increase in the consistency in packet handling through
   the network as well as an increase in the number of devices which



Retana & White         Expires September 15, 2013               [Page 6]

Internet-Draft        Measuring Network Complexity            March 2013


   must be configured and managed for the correct PHBs.  The network
   below is used for an illustration of this concept.

                              +----R1----+
                              |          |
                           +--R2--+   +--R3--+
                           |      |   |      |
                           R4     R5  R6     R7


   In this network, marking and PHB configuration may be configured on
   any device, R1 through R7.  Assume marking is configured at the
   network edge; in this case, four devices, (R4,R5,R6,R7), must be
   configured, including ongoing configuration management, to mark
   packets.  Moving packet marking to R2 and R3 will halve the number of
   devices on which packet marking configuration must be managed, but at
   the cost of consistent packet handling at the inbound interfaces of
   R2 and R3 themselves.  Thus reducing the number of devices which must
   have managed configurations for packet marking will reduce optimal
   packet flow through the network.  Assuming packet marking is actually
   configured along the edge of this network, configuring PHBs on
   different devices has this same tradeoff of managed configuration
   verses optimal traffic flow.  If the correct PHBs are configured on
   R1, R2, and R3, then packets passing through the network will be
   handled correctly at each hop.  The cost involved will be the
   management of PHB configuration on three devices.  Configuring a
   single device for the correct PHBs (R1, for instance), will decrease
   the amount of configuration management required, at the cost of less
   than optimal packet handling along the entire path.

6.  Configuration State verses Per Hop Forwarding Optimization

   The number of PHBs configured along a forwarding path exhibits the
   same complexity verses optimality tradeoff described in the section
   above.  The more types of service (or queues) traffic is divided
   into, the more optimally traffic will be managed as it passes through
   the network.  At the same time, each class of service must be
   managed, both in terms of configuration and in its interaction with
   other classes of service configured in the network.

7.  Reactivity verses Stability

   The speed at which the network's control plane can react to a change
   in configuration or topology is an area of widespread study.  Control
   plane convergence can be broken down into four essential parts:

   o  Detecting the change




Retana & White         Expires September 15, 2013               [Page 7]

Internet-Draft        Measuring Network Complexity            March 2013


   o  Propagating information about the change

   o  Determining the best path(s) through the network after the change

   o  Changing the forwarding path at each network element along the
      modified paths

   Each of these areas can be addressed in an effort to improve network
   convergence speeds; some of these improvements come at the cost of
   increased complexity.

   Changes in network topology can be detected much more quickly through
   faster echo (or hello) mechanisms, lower layer physical detection,
   and other methods.  Each of these mechanisms, however, can only be
   used at the cost of evaluating and managing false positives and high
   rates of topology change.  If the state of a link change can be
   detected in 10ms, for instance, the link could theoretically change
   state 50 times in a second --it would be impossible to tune a network
   control plane to react to topology changes at this rate.  Injecting
   topology change information into the control plane at this rate can
   destabalize the control plane, and hence the network itself.  To
   counter this, most fast down detection techniques include some form
   of dampening mechanism; configuring and managing these dampening
   mechanisms represents an added complexity that must be configured and
   managed.

   Changes in network topology must also be propagated throughout the
   network, so each device along the path can compute new forwarding
   tables.  In high speed network environments, propagation of routing
   information changes can take place in tens of milliseconds, opening
   the possibility of multiple changes being propagated per second.
   Injecting information at this rate into the contral plane creates the
   risk of overloading the processes and devices participating in the
   control plane, as well as creating destructive positive feedback
   loops in the network.  To avoid these consequences, most control
   plane protocols regulate the speed at which information about network
   changes can be transmitted by any individual device.  A recent
   innovation in this area is using exponential backoff techniques to
   manage the rate at which information is injected into the control
   plane; the first change is transmitted quickly, while subsequent
   changes are transmitted more slowly.  These techniques all control
   the destabalilzing effects of rapid information flows through the
   control plane through the added complexity of configuring and
   managing the rate at which the control plane can propagate
   information about network changes.

   All control planes require some form of algorithmic calculation to
   find the best path through the network to any given destination.



Retana & White         Expires September 15, 2013               [Page 8]

Internet-Draft        Measuring Network Complexity            March 2013


   These algorithms are often lightweight, but they still require some
   amount of memory and computational power to execute.  Rapid changes
   in the network can overwhelm the devices on which these algorithms
   run, particularly if changes are presented more quickly than the
   algorithm can run.  Once the devices running these algorithms become
   processor or memory bound, it could experience a computational
   failure altogether, causing a more general network outage.  To
   prevent computational overloading, control plane protocols are
   designed with timers limiting how often they can compute the best
   path through a network; often these timers are exponential in nature,
   allowing the first computation to run quickly, while delaying
   subsequent computations.  Configuring and managing these timers is
   another source of complexity within the network.

   Another option to improve the speed at which the control plane reacts
   to changes in the network is to precompute alternate paths at each
   device, and possibly preinstall forwarding information into local
   forwarding tables.  Additional state is often needed to precompute
   alternate paths, and additional algorithms and techniques are often
   configured and deployed.  This additional state, and these additional
   algorithms, add some amount of complexity to the configuration and
   management of the network.  In some situations (for some topologies),
   a tunnel is required to pass traffic around a network failure or
   topology change.  These tunnels, while not manually configured,
   represent additional complexity at the forwarding and control planes.

8.  Conclusion

   This document describes various areas of network and design where
   complexity is traded off against some optimization in the operation
   of the network.  This is (by it's nature) not an exhaustive list, but
   it can serve to guide the measurement of network complexity and the
   search for other areas where these tradeoffs exist.

9.  Security Considerations

   None.

10.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.









Retana & White         Expires September 15, 2013               [Page 9]

Internet-Draft        Measuring Network Complexity            March 2013


Authors' Addresses

   Alvaro Retana
   Cisco Systems
   2610 Wycliff Road
   Raleigh, NC  27607
   USA

   Email: aretana@cisco.com


   Russ White
   Verisign
   12061 Bluemont Way
   Reston, VA  20190
   USA

   Email: russw@riw.us
































Retana & White         Expires September 15, 2013              [Page 10]