Internet DRAFT - draft-mtoy-anima-self-faultmang-framework

draft-mtoy-anima-self-faultmang-framework





Internet Engineering Task Force                                   M. Toy
Internet-Draft                                                   Comcast
Intended status: Informational                             June 29, 2015
Expires: December 31, 2015


Architectural Framework for Self-Managed Networks with Fault Management
                               Hierarchy
            draft-mtoy-anima-self-faultmang-framework-00.txt

Abstract

   This document describes a self-managed network identifying network
   problems during failures and repairing them.  Self-managed Network
   Element (sNE) architectures and Network Management System (sNMS)
   architectures for centrally and distributedly managed networks are
   described.  A hierarchy among repairing entities is defined.  An in-
   band message format for Metro Ethernet networks is proposed for the
   fault management communication.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on December 31, 2015.

Copyright Notice

   Copyright (c) 2015 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must



Toy                     Expires December 31, 2015               [Page 1]

Internet-Draft          self-faultmang-framework               June 2015


   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  sNE Architecture . . . . . . . . . . . . . . . . . . . . . . .  3
   3.  Self-Managing Network Management System (sNMS) Architecture  .  6
   4.  Intelligent Agent Architecture . . . . . . . . . . . . . . . .  9
   5.  Self and Centrally Managed Networks  . . . . . . . . . . . . .  9
   6.  Self and Distributedly Managed Networks  . . . . . . . . . . . 10
   7.  In-band Communications of Failure types, Estimated Fix
       Time and Fix . . . . . . . . . . . . . . . . . . . . . . . . . 11
   8.  Failure Fixing Hierarchy in Centrally Managed Networks . . . . 13
   9.  Failure Fixing Hierarchy in Distributedly Managed Networks . . 14
   10. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 15
   11. Security Considerations  . . . . . . . . . . . . . . . . . . . 16
   12. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 16
   13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
     13.1.  Normative References  . . . . . . . . . . . . . . . . . . 16
   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 16




























Toy                     Expires December 31, 2015               [Page 2]

Internet-Draft          self-faultmang-framework               June 2015


1.  Introduction

   The industry is focused on auto-configuration [GANA], [SUSEREQ],
   [SELFMAN] and monitoring of network resources and services, isolating
   problems when there are failures, and fixing them by sending
   technicians to the sites most of the time or downloading certain
   configuration files remotely for configuration related problems.  The
   concept of network identifying problems by itself and fixing them and
   only sending technicians to the failure site only when there are
   single-point of hardware failures (i.e. there is no hardware
   redundancy) is not practiced [SMN,SMCEN].  Tools for self-managed
   networks are not developed either.  On the other hand, auto-
   configuration of network elements (NEs) such as cable modem (CM) and
   cable modem termination system (CMTS) is being practiced by Multiple
   System Operators (MSOs) using Data Over Cable Service Interface
   Specification (DOCSIS) back-office systems.  Similar procedures are
   also used by DPoE networks [DPoE] for auto-configuration of NEs and
   services.  This draft does not discuss the auto-configuration, but
   focuses on fault management aspects of self and centrally or
   distributedly managed networks.

   This draft describes a self-managed network where each self-managed
   NE (sNE) in a network monitors its hardware and software resources
   periodically, runs diagnostics tests during failures in a
   hierarchical fashion, identifies problems if they are local to the
   sNE and fixable by the sNE, and reports failures and fixes to a
   centralized network management system (sNMS) to be accessed by
   network operators, field technicians, customers, and other sNEs in
   the network.  If the problem is not locally fixable by the sNE, the
   Self-Managing Regional NMS (sNMRn) or sNMS runs its own rule-based
   logic to determine if the problem is fixable remotely by the sNMRn or
   sNMS.  If it is not, a message (i.e.notification) is sent to a
   network operator or field technician to fix the problem.

   Failure type; if the problem is fixable locally by sNE, remotely by
   sNMRn or sNMS, or remotely by a technician; and estimated fix time
   are communicated with a newly defined message format.  The hierarchy
   of fixing failures is network architecture dependent, as discussed in
   sections 8 and 9.


2.  sNE Architecture

   An sNE (Figure 1) consists of an intelligent NE (iNE) and intelligent
   agents.

   The intelligent NE (iNE) is built to have redundant hardware and
   software components as depicted in Figure 1, where each hardware or



Toy                     Expires December 31, 2015               [Page 3]

Internet-Draft          self-faultmang-framework               June 2015


   software component is intelligent enough to run its own diagnostics
   and identify faulty subcomponents.  Self-managing agents (i.e.
   intelligent agents) may take over after internal diagnostics of each
   component is completed.  Furthermore, iNE keeps a redundant copy of
   its current or default configuration.

   The intelligent subcomponents can be smallest replaceable units such
   as chips, operating system, and protocol software that are capable of
   periodic self-checking, declaring a failure when it is unable to
   perform its functions, running diagnostics and identifying whether
   the faulty entity is within the subcomponent or not, escalating the
   diagnostics to the next level in the hierarchy when the diagnostics
   are inconclusive.

   When there is a failure, if failed entity is unidentified as a result
   of the diagnostics tests run by the intelligent subcomponents, the
   iNE is able to run diagnostics for a pre-defined set of subcomponents
   that are collectively performing a specific function.  A pre-defined
   set of subcomponents can be a collection of components that are
   contributing to the realization of a main function such as packet
   forwarding, deep packet inspection, event forwarding, etc.

   If the diagnostics tests ran for a pre-defined set of subcomponents
   cannot identify the failed entity, the iNE is able to run diagnostics
   at NE level to determine the failure.  After the failure is
   identified to the smallest replaceable hardware (e.g. chips, wires
   connecting chips, backplane, etc.) and/or software entity (e.g.
   kernel, log, protocol software, event forwarding discriminator,
   etc.), the responsible intelligent agents determine if the failure is
   fixable and initiates a message to related parties with estimated fix
   time to repair.  If the iNE diagnostics are inconclusive, then that
   will be communicated as well.

   Each self-managing agent (i.e. intelligent agent) monitors the entity
   that it belongs to, and may run additional diagnostic tests to
   identify problems during failures, initiates a failure message, fixes
   problems, and initiates a fix notification to the central or regional
   self-managing systems and other related entities.  The message (i.e.
   notification) indicating that the fixing entity is sNE, is
   communicated to other sNEs, regional and central network management
   systems systems, field technicians and customers (if desired).  If
   the problem is determined to be not fixable locally after two-three
   tries or without a try, depending on the problem, a message is sent
   to the regional or central network management systems by the sNE
   indicating that the fixing entity is unidentified.

   The intelligent agents are one or more intelligent Hardware
   Maintenance Agent(s) (iHMA(s)), intelligent Operating System



Toy                     Expires December 31, 2015               [Page 4]

Internet-Draft          self-faultmang-framework               June 2015


   Maintenance Agent (s) (iOMA (s)), intelligent Application Maintenance
   Agent (s) (iAMA (s)), and intelligent Capacity Management Agent (s)
   (iCMA (s)), depending on the implementation.

   The iHMA is capable of periodically monitoring hardware entities such
   as CPU, memory, physical ports, communication channels, buffers,
   backplane, power supplies, etc., and initiating pre-defined
   maintenance actions during hardware failures. iOMA is capable of
   periodically monitoring operating system and initiating pre-defined
   maintenance actions during Operating System failures.  The iAMA is
   capable of periodically monitoring application software and protocol
   software, and initiating pre-defined maintenance actions during
   application and protocol software failures.  The iCMA is capable of
   periodically monitoring system capacity, load and performance, and
   collecting measurements.  When capacity thresholds are exceeded, the
   iCMA initiates pre-defined maintenance actions.


      +----------------------------------------------------------------+
      |           +--------------------------------------------------+ |
      |           |                    iNE                           | |
      | +-------+ |  +------------------+   +---------------------+  | |
      | | iCMA  | |  | intelligent and  |   |  intelligent and    |  | |
      | |       | |  | redundant HW     |   |  redundant SW       |  | |
      | +-------+ |  | subcomponent 1   |   |  subcomponent 1+    |  | |
      | +-------+ |  +---------+--------+   +----------+----------+  | |
      | |  iOMA | |            |                       |             | |
      | |       | |                                                  | |
      | +-------+ |            |                       |             | |
      | +-------+ |  +---------+---------+  +----------+-----------+ | |
      | |  iAMA | |  | intelligent and   |  | intelligent and      | | |
      | |       | |  | redundant HW      |  | redundant SW         | | |
      | +-------+ |  | subcomponent N    |  | subcomponent N       | | |
      | +-------+ |  +-------------------+  +----------------------+ | |
      | |iHMA   | +--------------------------------------------------+ |
      | |       |            +------------------------+                |
      | +-------+            |backup copy of current  |                |
      |                      |configuration files     |                |
      |                      +------------------------+                |
      +----------------------------------------------------------------+

                  Figure 1: Self-Managed NE Architecture









Toy                     Expires December 31, 2015               [Page 5]

Internet-Draft          self-faultmang-framework               June 2015


+----------------------------------------------------------------------+
|  +--------------------------------------------------------+          |
|  | Hierarchical Diagnostic and Trouble Isolation Logic    |          |
|  +--------------------------------------------------------+          |
| |---------------------------------------------------------+ +--+ +--+|
| |------------------------------------------------------+  | |  | |  ||
| || +---------------------+    +---------------------+  |  | |S | |S ||
| || |intelligent smalles  |    |intelligent smallest |  |  | |u | |u ||
| || |replaceable unit     | ++ |repleacable unit     |  |  | |b | |b ||
| || | ISRU)               |    |(ISRU)               |  |  | +  | +  ||
| || +---------------------+    +---------------------+  |  | |C | |C ||
| ||   a group of ISRUs providing functions collectively |  | |o | |o ||
| +-----------------------------+------------------------+  | |m | |m ||
| |                             |                           | |p | |p ||
| |                             |                           | |o | |o ||
| +------------------------------------------------------+  | |n | |n ||
| || +--------------------+     +----------------------+ |  | |e | |e ||
| || |intelligent smallest|     |intelligent smallest  | |  | |n | |n ||
| || |replaceable unit    | ++  |replaceable unit      | |  | |t | |t ||
| || |(ISRU)              |     |(ISRU)                | |  | |  | |  ||
| || +--------------------+     +----------------------+ |  | |2 | |N ||
| ||   a group of ISRUs providing functions collectively |  | |  | |  ||
| +------------------------------------------------------+  | |  | |  ||
| |                   SubComponent 1                        | |  | |  ||
| +---------------------------------------------------------+ +--+ +--+|
++ --------------------------------------------------------- +---------+

                   Figure 2: Intelligent NE Architecture


3.  Self-Managing Network Management System (sNMS) Architecture

   A Central or Regional sNMS consists of an intelligent NMS (iNMS) that
   mainly deals with remote fixes, a Task Manager (TM) to manage tasks
   to be executed, copies of software modules for each type of sNE, a
   Traffic Manager (TrfMgr) to deal with network level traffic
   management issues such as routing policies, load balancing,
   connection admission control, congestion control, Event Forwarding
   Discriminator (EFD) to forward failures and fixes to network
   operators and customers, data base(DB) to store data, and a user
   interface such as a Graphical User Interface (GUI) (Figure 3).  The
   sNMS is redundant where the active sNMS is protected by a stand-by
   sNMS.  The iNMSs in active and stand-by units perform periodic self-
   checking.  When the active sNMS fails, the stand-by sNMS takes over
   the responsibilities.

   The user interface provides human and machine interfaces.  A Database
   (DB) stores user interface events and data collected from network.  A



Toy                     Expires December 31, 2015               [Page 6]

Internet-Draft          self-faultmang-framework               June 2015


   Task Manager prioritizes and schedules execution of the tasks
   including repair and configuration of activities that can be
   performed remotely using a Rule Based Logic module.  A Data Handler
   collects end-to-end connection level measurements and sNE level
   capacity measurements, and stores them in the DB to support the
   TrfMgr.

   The Task Manager (TM) of sNMS manages tasks to be executed by the
   sNMS.  The Rule-Based Logic determines if the problem is remotely
   fixable by the iNMS.

   The iNMS is expected to include a Fix Manager (FixMgr) for each sNE
   type to fix the sNE problems remotely; store software modules
   specific to the sNE; and capable of running network level traffic
   management algorithms such as routing policies, load balancing,
   connection admission control and congestion control.  Furthermore,
   the iNMS holds a copy of each sNE agent and remotely loads into sNEs
   when needed.

































Toy                     Expires December 31, 2015               [Page 7]

Internet-Draft          self-faultmang-framework               June 2015


+---------------------------------------------------------------------+
|                           GUI                                       |
+--------------------------------------------------------+--+---------+
|                         iNMS                              |event    |
+--------------------------------------------------------+  |forwarder|
|| hierarchical network level  diagnostics and Trouble   || +--+------+
|| identification logic                                  ||    |
|-------------------+--------------------------+----------| +--+--++--+
||copies of iNE     |    |copies of iNE        | |Fix Mgr|| |rule ||DB|
||Type 1 Software:  | + ++Type N Software:     | |for iNE|--|based|--+|
||Operating System, |    |Operatinng System,   | |Type 1 || |logic|+--+
||Applications, etc.|    |Applications,etc.    | +---+----+ +-----++---+
|-------------------+    +---------------- ----+     |    | |Task ||Trf|
|------------------+ ++ +----------------------+ +---+----+ | mgr ++Mgr|
||periodic network    | |traffic management    | |Fix Mgr|| |     |+---+
||level monitoring,   | |algorithms and policies |for iNE|| +--+--+
||network level troubl| |, connection admission| |Type N ||    |
||isolation and fixing| |control, load balancing +--------+ +--+----+
||network level       | |, congestion control  |          | |data   |
||troubles            | |                      |          | |handler|
|---------------------+ +----------------------+          | +-------+
|---------------------+  +----------------------+         |
||periodic self checking |rule based logic to   |         |
||and switchover to   |  | verify iNE reporting |         |
||back-up iNMS during |  |of that failure is not          |
|| failures.          |  |local and if it is    |         |
+---------------------+   |fixable by iNMS      |         |
|                        +----------------------+         |
+---------------------------------------------------+-----+

                 Figure 3: Self Managing NMS Architecture

   An intelligent NMS (iNMS) (Figure 3) periodically monitors the
   network that sNMS is managing, identifies network level failures,
   estimates and communicates the fix time to related parties, and fixes
   them.  When the sNE reports that the failure is not local (i.e.
   either tests are inconclusive or sNE is not capable of fixing it),
   the Rule Based Logic of the sNMS verifies if the sNE failure is not
   local.

   There are no changes introduced to interfaces between the management
   systems and the network for self-management.  The well-known
   protocols such as SNMP, IPDR (IP Detail Record) for usage
   information, Network Configuration (NETCONF) for manipulating
   configuration data and examining state information, and YANG modeling
   can be employed.





Toy                     Expires December 31, 2015               [Page 8]

Internet-Draft          self-faultmang-framework               June 2015


4.  Intelligent Agent Architecture

   The intelligent agent architecture is depicted in Figure 4.  Its Rule
   Based Logic module determines problems and initiates fixes if the
   problems are local to sNE, initiates tests for the fixes, determines
   if the fix procedure or a step or some of the steps are to be
   repeated, and initiates a message to all related parties about the
   fix.  If the problem is not local to the sNE, the agent informs all
   related parties including the sNMS for its conclusion which is that
   the fixing entity is unidentified.  If the result of diagnostics
   cannot identify the failed component which is inconclusive, that will
   be conveyed as well.

   A Scheduler module determines the priority and order of the tasks for
   each functional entity within the sNE that it belongs to.  An
   Application Programming Interface (API) provides an interface to
   various types of Software and Hardware entities within the sNE.  A
   Data Handler module collects necessary data for the sNE, performs the
   fix, and keeps the data associated with the task.  The Authorization
   (AUTH) module authenticates local user access and remote user access
   from the sNMS interface to sNE agents.  The Utilities module supports
   various file operations.

+-----------------------------------------------------------+
|                                                           |
| +------+       +----------------------------+ +-------+   | +--------+
| |Rule  |       | Scheduler                  +-+       |   | |NE      |
| |Based |       +----------------------------+ |       |   | |        |
| +Logic +                                      | API   |   | |SW/HW   |
| -------         +---------------------------+ |       |   +-+Modules |
| +-------------+ |     Data Handler          | |       |   | |        |
| |Authorization| |                           +-+       |   | +--------+
| +-------------+ +---------------------------+ +---+---+   |
|                 | Task|| Fix     ||Data     |     |       |
|                 | Data|| Delivery+|Collector|     |       |
|                 |     || Agent   | ---------+ +---+-------+
|                 +----------------+            |Utilities ||
|                                               +-----------|
+-----------------------------------------------------------+

                 Figure 4: Intelligent Agent Architecture


5.  Self and Centrally Managed Networks

   A self and centrally managed network architecture consisting of self-
   managed NEs and self-managing NMS is depicted in Figure 5.




Toy                     Expires December 31, 2015               [Page 9]

Internet-Draft          self-faultmang-framework               June 2015


   sNE related failures are handled locally by the sNE.  If the problem
   is determined to be not fixable by the sNE after two or three tries
   or without a try, depending on the problem, a message is sent to the
   sNMS by the sNE indicating that the fixing entity is unidentified.
   If the problem is locally fixable, sNE send a message to SNMS, other
   sNEs, field technicians and users, indicating the fixing entity and
   how long the fix is going to take.

   +-----------+                               +-----------+
   |   iNE     |                     SNMP/YANG +  iNMS     |
   |           +-----+XXXXXXXXXXXX +---------- | TrfMgr,   |
   |iHMA, iCMA |    X|X          XX+XXXX       |GUI,TM, EFD|
   |iOMA, iAMA |   XX|                 XXXX    |DB, FixMgr |
   +-----------+  XX +                     XX  +-----------+
       sNE        X                         XX     sNMS
                  X       Network            X
                  XX                         X
                   X                        XX
    +-----------+  X                       XX   +-----------+
    |iNE        |  X+X                   XXX+---+iNE        |
    |           +---+XXXXXXXXXXXX   X XX+       |           |
    |iHMA, iCMA |               XXX             |iHMA, iCMA |
    |iOMA, iAMA |                               |iOMA, iAMA |
    +-----------+                               +-----------+
        sNE                                           sNE

         Figure 5: Self and Centrally Managed Network Architecture


6.  Self and Distributedly Managed Networks

   A self and distributedly managed network architecture is given in
   Figure 6.  The network is divided into multiple regions where each
   region is managed by a self-managing NMS (sNMRn ).  One of the sNMRs
   in the network acts as the central sNMS.  The regional self-managing
   sNMRs and central self-managing NMS are connected to each other via
   in-band and/or out-of-band communications facilities.














Toy                     Expires December 31, 2015              [Page 10]

Internet-Draft          self-faultmang-framework               June 2015


                             +--------------+        +--------------+
                             |Regional      |        | Self-Managing|
         +---+               |Self-Managing +--------+ NMS          +--+
         |sNE|           +---+NMS           |        +-------+------+  |
         +---+   XXXXXXXX+   +--------------+                |         |
             XXXXX       XX                           XXXXXXX++        |
    +---+   XX  Network   +    Links Connecting   XXXXX       XXXXXX+  |
    |sNE|   X   Region 1  +-----------------------+    Network      X  |
    +---+   XXXX XXXXXXXXXX    Regional Networks  X    Region N     X  |
               XXX                  +------------+X               XXX  |
                +                   |             XXXXXXXXXXXXXXXX     |
                |                   |                                  |
  +---+    XXXXX+XXXXXXXXXX+--------+             in-band or out-of    |
  |sNE|   XX    Network   X|                      +--------------------+
  +---+   X     Region 2  X|   +--------------+   | band Connectivity
  +---+   X               X+---+Regional      +---+ among NMSs
  |sNE|   XXXXXXXXXXXXXXXXX    |Self-Managing |
  +---+    X                   |NMS 2         |
                               +--------------+

         Figure 6: Self Distributedly Managed Network Architecture

   sNMRn provides all the centralized management functions for its own
   subnet and informs the central sNMS about its activities.  End-to-end
   network level activities beyond region boundaries will be left to
   sNMS.  These activities can be Connection Admission Control (CAC),
   load balancing, and congestion control at end-to-end network level.


7.  In-band Communications of Failure types, Estimated Fix Time and Fix

   In today's networks, failures related to equipment, ports and
   connections are mostly reported to an NMS via SNMP traps or in-band
   communications to NEs via AIS (Alarm Indication Signal), RDI (Remote
   Defect Indicator), Connectivity Check Message (CCM) related events
   such as Loss of Continuity (LoC) [Y.1731], etc.  These alarms and
   traps identify the failed NE, port, or connection, but don't identify
   the component contributing to the failure.  Furthermore, each has a
   different format.

   For self-management, it is necessary to identify faulty components,
   estimate the time for fix, and communicate that to all parties
   involved (i.e. sNEs, sNMRn , sNMS, field technicians, and customers),
   so that working sNEs can store (if desired) data routed to the failed
   sNE(s) for the duration of fix or re-route traffic around the failed
   sNE(s) or port(s).  For simplicity, all messages should have the same
   format.




Toy                     Expires December 31, 2015              [Page 11]

Internet-Draft          self-faultmang-framework               June 2015


   Figure 7 depicts a possible Ethernet frame for Ethernet networks to
   carry all the information described above.  Similar messages are to
   be created for other types of networks such as IP, MPLS and IMS.

+-----------------------------------+--------------+-----+------------+
|IFG|P |SFD|DA|SA|L/T|fNE|fComp|Op  | Failure| Fix | Fix | PAD (25|CRC|
|   |  |   |  |  |   |ID |ID   |Code|  Code  | Code| Time|bytes 0)|   |
+---+--+---+--+--+---------------------------+--------------------+---+

IFG: Interframe Gap, 12 bytes
P/SFD (Preamble/Start of Frame Delimiter)-8 Bytes(P-7 bytes, SFD-1 byte)
L/T (Length/Type) : Length of frame or data type, 2 bytes (0x8808)
CRC: 4 bytes
DA: 01:80:C2:00:00:02 (6 bytes)-Slow protocol multicast address
fNE ID: 6 bytes, Failed sNE Identifier
fComp ID: 4 bytes, Failed Component Identifier
Op Code: 2 bytes-0x0202 for Disabled and 0x0303 for Enabled status
Failure Code : 4 bytes
Fix Code: 1 byte identifying fixing entity, NE (x00), sNMS (x01),
          sRMS (x02),  sNMS-v (x03), RNMS-v (x04), sNMS-s (x05),
          sRNMS-s (x06), field technician (x07),
          unidentified entity or inconclusive diag(x08)
Fix Time: 4 bytes indicating fix time in seconds by NE, NMS,
          or field technician


       Figure 7: Self-Managing message frame format for Self-managed
                             Ethernet networks

   For Ethernet networks, slow protocol multicast address can be used to
   inform sNEs, sNMS, and field technician devices connected to the
   network. fNE ID indicates MAC address of the failed sNE. fComp ID
   indicates the failed component identifier within the sNE.  Op Code
   indicates whether the sNE or port is operationally disabled or
   enabled.

   This operational status is disabled during failures and becomes
   enabled after the failure is fixed.  Failure Code indicates failure
   type.  If failure type is unidentified thru diagnostics, Failure Code
   will be unidentified or inconclusive or the failure is not-local to
   sNE.  Fix Code identifies repairing entity whether it is sNE, sNMRn ,
   sNMS, or a field technician.

   It is possible to allocate six bytes to Fix Code field to indicate
   MAC address of the fixing entity.  It is also possible to identify
   the failure type and not fix it.  In this case, fixing entity is
   unidentified.  It is also possible that both failure code and fix
   code are unidentified.  Fix time indicates the estimated time in



Toy                     Expires December 31, 2015              [Page 12]

Internet-Draft          self-faultmang-framework               June 2015


   seconds for repair which is set by the repairing entity.  In order
   for sNE, sNMRn , or sNMS to provide the estimated fix time, the fix
   time for each type of failure needs to be stored in sNE and sNMRn or
   sNMS.  If the failure is going to be fixed by a field technician, the
   technician may enter fix time manually into the related management
   system to communicate that to all related parties.

   Given the sNMRn and sNMS interface uses network management protocols
   such as SNMP, the information in the message (Figure 5) needs to be
   conveyed to sNMS via an SNMP trap.  Similarly the SNMP trap from
   sNMRn and sNMS needs to be converted into an in-band message to
   convey the information to self-managing NEs.


8.  Failure Fixing Hierarchy in Centrally Managed Networks


                                   +---------+
                                   | Failure |
                                   | in sNE  |
                                   +----+----+
                                        |
                                        |
                                     XXX+XX
                          No      XXXX     XXXXX     Yes
                      +--------+XXXis it locallyXXX+-------+
                      |        XXX fixable by XXXX         |
                      |          XXXX sNE? XXXX            |
                    XX+XXX          XXXX XXX               |
          No   XXXXXX    XXXXXXX  Yes          +-----------+-----------+
        +----+XX is it remotelyX+------+       |sNMS,field technicians,|
        |     XX fixable by   XX       |       |sNEs and users wait for|
        |      X sNMS?    XXXXX        |       |estimated Fix Time for |
        |      XXX    XXXXX            |       |notification from sNE  |
        |         XXXXX                |       +-----------------------+
        |                              |
        |                              |
        |                              |
  +-----+-------------------+   +------+-----------------+
  |Field technician sets fix|   |sNMS sets Fix Time in   |
  |time in notification and |   |notification and sends  |
  |nitiate the notification |   |the notification to a   |
  |to sNEs, sNMS and users  |   |sNE to communicate that |
  +-------------------------+   |to other sNEs, field    |
                                |technicians and users   |
                                +------------------------+

    Figure 8: Fault Management Hierarchy for Self and Centrally Managed



Toy                     Expires December 31, 2015              [Page 13]

Internet-Draft          self-faultmang-framework               June 2015


                                 Networks

   In a centrally managed network, when there is a failure, sNE
   determines if the failure is local to the sNE or not.  If the failure
   is local, then the sNE informs other sNEs, sNMS, field technicians
   and customers about failure type and fix time.  If NE decides that
   the failure is not local to sNE, then sNE escalates the problem to
   the sNMS.  The sNMS verifies that it is not local to the sNE and
   determines if it can fix the problem.  If the sNMS can fix the
   problem, the sNMS communicates the failure type and fix time to sNEs,
   field technicians and customers.  If the sNMS determines the failure
   is not fixable, the sNMS escalates the problem to field technicians.
   The field technician communicates fix time to sNEs, the sNMS and
   customers.  After the fix is completed, the fixing entity initiates a
   self-managed notification with Enabled status (i.e.  Opcode is set to
   Eanabled) to other sNEs, the sNMS, and customers.  Both sNMS and
   field technicians use one of the sNEs to send notifications to the
   remaining interested parties.

   The sNMS and field technician communicates failures and fixes via a
   message from the sNMS.  If there is a node failure (i.e. sNE
   completely fails due to a power failure for example), neither the
   sNMS nor field technicians is able to communicate with the sNE.
   Therefore, the sNMS and field technicians would use another sNE to
   communicate the failure.


9.  Failure Fixing Hierarchy in Distributedly Managed Networks

   In distributed architecture, the network is divided into sub-networks
   (I.e. regional networks), where each sub-net has its own sNMRn .

   sNMRn provides all the centralized management functions for its own
   subnet and informs sNMS about its activities.

   End-to-end network level monitoring and problem fixing beyond
   regional boundaries are left to sNMS.  These activities can be
   Connection Admission Control (CAC), load balancing, and congestion
   control at network level.












Toy                     Expires December 31, 2015              [Page 14]

Internet-Draft          self-faultmang-framework               June 2015


                                        +---------+
                                        | Failure |
                                        | in sNE  |
                                        +----+----+
                                             |
                                             |
                                          XXX+XX
                               No      XXXX     XXXXX     Yes
                           +--------+XXXis it locallyXXX+-+
                           |        XXX fixable by XXXX   |
                           |          XXXX sNE? XXXX      |
                         XX+XXX          XXXX XXX         |
               No   XXXXXX    XXXXXXX  Yes  +-------------+------------+
              +---+XX is it remotelyX+----+ |sNMS,sNMRs, sNMS, field   |
              |    XX fixable by   XX     | |technicians and users wait|
              |     X sNMR?    XXXXX      | |for estimated Fix Time for|
              |     XXX    XXXXX          | |notification from sNE     |
              |        XXXXX              | +--------------------------+
     No  XXXXXXXXXXXXXXXXXX     Yes     +-+----------------------+
   +---+XXis it remotelyXX+----------+  |sNMR sets Fix Time in   |
   |  XX  fixable by XXX             |  |notification and sends  |
   |   X  sNMS    XXXX               |  |the notification to     |
   |   XXXXXXXXXXXX                  |  | a sNE  to              |
   |                                 |  | communicate that to    |
   |                                 |  |other sNEs, sNMRs, sNMS |
   |                                 |  | and users              |
   |                                 |  +------------------------+
+--+----------------------       +---+---------------------+
|Field technician sets fix       |sNMS sets Fix Time in    |
|time in notification and        |notification to a sNE to |
|nitiate the notification        |communicate it to other  |
|to sNEs,sNMRs, sNMS, and +      |sNEs, sNMRs, field       |
| users                   |      |technicians and users    |
+-------------------------+      +-------------------------+

      Figure 9: Fault Management Hierarchy for Self and Distributedly
                             Managed Networks


10.  Conclusion

   Self-managed network concept for fault management, self-managed NE
   and self-managing NMS architectures, and a fault management
   communication mechanism for centrally and distributedly self-managed
   networks are introduced.  A hierarchy for fault management for these
   networks are described.





Toy                     Expires December 31, 2015              [Page 15]

Internet-Draft          self-faultmang-framework               June 2015


11.  Security Considerations

   It is expected that all sNEs, sNMS, and sNMRn are authenticated
   during the network configuration manually or automatically.  If there
   are security mechanisms established among sNEs, sNMS, sNMRn for
   exchanging messages, they would apply for exchanging the fault
   messages described here.  There is no need for additional security
   procedures for the fault management messages described here.


12.  IANA Considerations

   This document does not request any action from IANA.


13.  References

13.1.  Informative References


  [GANA]    ETSI GS AFI 002 V1.1.1 : Autonomic network engineering for
            the self-managing Future Internet; Generic Autonomic Network
            Architecture, 2013-04
  [SUSEREQ] ETSI GS AFI 001 V1.1.1  Group Specification Autonomic
            network engineering for the self-managing Future
            Internet (AFI); Scenarios, Use Cases and Requirements
            for Autonomic/Self-Managing Future Internet, 2011-06
  [SELFMAN] Keller, Alexander; et al. (Eds.), Self-Managed Networks,
            Systems, and Services Second IEEE International
            Workshops, SelfMan 2006, Dublin, Ireland, June 16, 2006,
            Proceedings
  [DPoE]    E. Malette and M. Hajduczenia, Automating provisioning of
            Demarcation Devices in DOCSIS Provisioning of EPON (DPoE),
            IEEE Comm. Magazine, September, 2012
  [SMN]     M. Toy, Self-Managed Networks, Comcast internal document,
            November, 2012.
  [SMCEN]   M. Toy, Self-Managed Carrier Ethernet Networks, April 2014,
            MEF Meeting in Budapest, self-managed-networks-comcast-
            mtoy.pdf.,
            https://wiki.metroethernetforum.com/display/OWG/New+Work
  [Y.1731]  ITU-T Y.1731, OAM functions and mechanisms for Ethernet
            based networks, 2008









Toy                     Expires December 31, 2015              [Page 16]

Internet-Draft          self-faultmang-framework               June 2015


Author's Address

   Mehmet Toy
   Comcast
   1800 Bishops Gate Blvd.
   Mount Laurel, NJ  08054
   USA

   Email: mehmet_toy@cable.comcast.com










































Toy                     Expires December 31, 2015              [Page 17]