Internet DRAFT - draft-feng-opsawg-incident-management
draft-feng-opsawg-incident-management
OPSAWG C. Feng, Ed.
Internet-Draft
Intended status: Standards Track T. Hu
Expires: 2 August 2024 CMCC
LM. Contreras
Telefonica I+D
T. Graf
Swisscom
Q. Wu
C. Yu
Huawei
N. Davis
Ciena
30 January 2024
Incident Management for Network Services
draft-feng-opsawg-incident-management-04
Abstract
A network incident refers to an unexpected interruption of a network
service, degradation of a network service quality, or sub-health of a
network service. Different data sources including alarms, metrics
and other anomaly information can be aggregated into few amount of
network incidents by data correlation analysis and the service impact
analysis.
This document defines YANG Modules for the network incident lifecycle
management. The YANG modules are meant to provide a standard way to
report, diagnose, and resolve network incidents for the sake of
network service health and root cause analysis.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
Feng, et al. Expires 2 August 2024 [Page 1]
Internet-Draft Network Incident Management January 2024
This Internet-Draft will expire on 2 August 2024.
Copyright Notice
Copyright (c) 2024 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Sample Use Cases . . . . . . . . . . . . . . . . . . . . . . 5
3.1. Incident-Based Trouble Tickets dispatching . . . . . . . 5
3.2. Incident Derivation from L3VPN services Unavailability . 6
3.3. Multi-layer Fault Demarcation . . . . . . . . . . . . . . 6
3.4. Security events Automated Noise reduction based on
Situation awareness . . . . . . . . . . . . . . . . . . . 7
4. Incident Management Architecture . . . . . . . . . . . . . . 8
4.1. Interworking with Alarm Management . . . . . . . . . . . 9
4.2. Interworking with SAIN . . . . . . . . . . . . . . . . . 11
4.3. Relationship with RFC8969 . . . . . . . . . . . . . . . . 12
4.4. Relationship with Trace Context . . . . . . . . . . . . . 12
5. Functional Interface Requirements between the Client and the
Server . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1. Incident Identification . . . . . . . . . . . . . . . . . 12
5.2. Incident Diagnosis . . . . . . . . . . . . . . . . . . . 15
5.3. Incident Resolution . . . . . . . . . . . . . . . . . . . 16
6. Incident Data Model Concepts . . . . . . . . . . . . . . . . 16
6.1. Identifying the Incident Instance . . . . . . . . . . . . 16
6.2. The Incident Lifecycle . . . . . . . . . . . . . . . . . 16
6.2.1. Incident Instance Lifecycle . . . . . . . . . . . . . 16
6.2.2. Operator Incident Lifecycle . . . . . . . . . . . . . 17
7. Incident Data Model Design . . . . . . . . . . . . . . . . . 17
7.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 17
7.2. Incident Notifications . . . . . . . . . . . . . . . . . 18
7.3. Incident Acknowledge . . . . . . . . . . . . . . . . . . 20
7.4. Incident Diagnose . . . . . . . . . . . . . . . . . . . . 20
7.5. Incident Resolution . . . . . . . . . . . . . . . . . . . 20
8. Incident Management YANG Module . . . . . . . . . . . . . . . 21
Feng, et al. Expires 2 August 2024 [Page 2]
Internet-Draft Network Incident Management January 2024
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 33
9.1. The "IETF XML" Registry . . . . . . . . . . . . . . . . . 33
9.2. The "YANG Module Names" Registry . . . . . . . . . . . . 33
10. Security Considerations . . . . . . . . . . . . . . . . . . . 34
11. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 34
12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 35
13. References . . . . . . . . . . . . . . . . . . . . . . . . . 35
13.1. Normative References . . . . . . . . . . . . . . . . . . 35
13.2. Informative References . . . . . . . . . . . . . . . . . 36
Appendix A. Changes between revisions . . . . . . . . . . . . . 37
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 39
1. Introduction
[RFC8969] defines a framework for Automating Service and Network
Management with YANG to full life cycle network management. A set of
YANG data models have already been developed in IETF for Network
performance monitoring and fault monitoring,e.g.,A YANG [RFC7950]
data model for alarm management [RFC8632] defines a standard
interface for alarm management. A data model for Network and VPN
Service Performance Monitoring[RFC9375] defines a standard interface
for network performance management. In addition, distributed tracing
mechanism defined in [W3C-Trace-Context] can also be used to analyze
and debug operations, such as configuration transactions, across
multiple distributed systems.
However these YANG data models for network maintenance are based on
specific data source information and manage alarms and performance
metrics data separately by different layers in various different
management systems. In addition, the frequency and quantity of
alarms and performance metrics data reported to Operating Support
System (OSS) are increased dramatically (in many cases multiple
orders of magnitude) with the growth of service types and complexity
and greatly overwhelm OSS platforms; with existing known dependency
relation between fault, alarm and events at each layer (e.g.,packet
layer or optical layer), the traditional solutions, e.g., data
compression are time-consuming and labor-intensive, usually rely on
maintenance engineers' experience for data analysis, which result in
low processing efficiency, inaccurate root cause identification and
duplicated tickets. In addition, it is also difficult to assess the
impact of alarms, performance metrics and other anomaly data on
network services without known relation across layer of the network
topology data or the relation with other network topology data.
To address these challenges, a network wide incident-centric solution
is proposed to establish dependency relation with both network
service and network topology at different layers , which not only can
be used at specific layer in specific domain but also can be used to
Feng, et al. Expires 2 August 2024 [Page 3]
Internet-Draft Network Incident Management January 2024
span across layer for multi-layer network troubleshooting. A network
incident refers to an unexpected interruption of a network service,
degradation of a network service quality, or sub-health of a network
service [TMF724A]. Different data sources including alarms, metrics
and other anomaly information can be aggregated into few amount of
incidents irrespective layer by correlation analysis and the service
impact analysis. For example, the protocols related to the interface
fail to work properly due to Service Level Objective (SLO) violation,
large amount of alarms may be reported to upper layer management
system and aggregated into one or a few incidents when some network
services may be affected by this incident (e.g. L3VPN services
related with the interface will become unavailable
[I-D.ietf-ippm-pam] ). An incident may also be raised through the
analysis of some network performance metrics, for example, as
described in SAIN [RFC9417] , network services can be decomposed to
several sub-services, specific metrics are monitored for each sub-
service, symptoms will occur if services/sub-services are unhealthy
(after analyzing metrics), these symptoms may raise one incident when
it causes degradation of the network services.
In addition, Artificial Intelligence (AI) and Machine Learning (ML)
play a important role in the processing of large amounts of data with
complex data correlations. For example, Neural Network Algorithm or
Hierarchy Aggregation Algorithm can be used to replace manual alarm
data correlation. Through online and offline learning, these
algorithms can be continuously optimized to improve the efficiency of
fault diagnosis.
This document defines the concepts, requirements, and architecture of
incident management. The document also defines a YANG data model for
incident lifecycle management, which improves troubleshooting
efficiency, ensures network service quality, and improves network
automation [RFC8969].
2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP
14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
The following terms are defined in [RFC8632] and are not redefined
here:
* alarm
The following terms are defined in this document:
Feng, et al. Expires 2 August 2024 [Page 4]
Internet-Draft Network Incident Management January 2024
Incident: An unexpected interruption of a network service,
degradation of network service quality, or sub-health of a network
service [TMF724A].
Problem: The cause of one or more incidents. The cause is not
usually known when a problem record is created, and the problem
management process is responsible for further investigation
[TMF724A].
Incident management: Lifecycle management of incidents including
incident identification, reporting, acknowledge, diagnosis, and
resolution.
Incident management system: An entity which implements incident
management. It include incident management server and incident
management client.
Incident management server: An entity which provides some functions
of incident management. For example, it can detect an incident,
perform incident diagnosis, resolution and prediction,etc.
Incident management client: An entity which can manage incidents.
For example, it can receive incident notifications, query the
information of incidents, instruct the incident management server
to diagnose, resolve, etc.
3. Sample Use Cases
3.1. Incident-Based Trouble Tickets dispatching
Traditionally, the dispatching of trouble tickets is mostly based on
alarms data analysis and need to involve operators' maintenance
engineers. These operators' maintenance engineers are able to
monitor and detect that alarms at both end devices of specific
network tunnel or at both optical layer and IP layer which are
associated with the same network fault. Therefore, they can
correlate these alarms to the same trouble ticket, which is in the
low automation. If there are more alarms, then the human costs for
network maintenance are increased accordingly.
Some operators preconfigure whitelist and adopt some coarse
granularity data correlation rules for the alarm management. It
seems to improve fault management automation. However, some trouble
tickets could be missed if the filtering conditions are too strict.
If the filtering conditions are not strict, it might end up with
multiple trouble tickets being dispatched to the same network fault.
Feng, et al. Expires 2 August 2024 [Page 5]
Internet-Draft Network Incident Management January 2024
It is hard to achieve a perfect balance between the network
management automation and duplicated trouble tickets under the
traditional working situations. However, with the help of the
incident management, massive alarms can be aggregated into a few
network incidents based on service impact analysis, the number of
trouble tickets will be greatly reduced. At the same time, the
efficiency of network troubleshooting can be largely improved. which
address the pain point of traditional trouble ticket dispatching.
3.2. Incident Derivation from L3VPN services Unavailability
The service attachment points defined in [RFC9408] represent the
network reference points where network services can be delivered to
customers.
SLOs can be used to characterize the ability of a particular set of
nodes to communicate according to certain measurable expectations
[I-D.ietf-ippm-pam]. For example, an SLA might state that any given
SLO applies to at least a certain percentage of packets, allowing for
a certain level of packet loss and exceeding packet delay threshold
to take place. An SLA might establish a multi-tiered SLO of end to
end latency as follows:
* not to exceed 30 ms for any packet;
* not to exceed 25 ms for 99.999% of packets;
* not to exceed 20 ms for 99% of packets.
This SLA information can be bound with two or multiple service
attachment point defined in [RFC9408], so that the service
orchestration layer can use these interfaces to commit the delivery
of a service on specific point to point service topology or point to
multi-point topology. Upon specific levels of a threshold of an SLO
is violated, a specific network incident, associated with,let's say
L3VPN service will be derived.
3.3. Multi-layer Fault Demarcation
When a fault occurs in a network that contains both packet-layer
devices and optical-layer devices, it may cause correlative faults in
both layers, i.e., packet layer and optical layer. Specifically,
fault propagation could be classified into three typical types.
First, fault occurs at a packet-layer device will further cause fault
(e.g.,WDM (wavelength division multiplexing) client fault) at an
optical-layer device. Second, fault occurs at an optical-layer
device will further cause fault (e.g., L3 link down) at a packet-
layer device. Third, fault occurs at the inter-layer link between a
Feng, et al. Expires 2 August 2024 [Page 6]
Internet-Draft Network Incident Management January 2024
packet-layer device and an optical-layer device will further cause
faults at both devices. Traditionally, multiple operation teams are
needed to first analyze huge amount of alarms (triggered by the above
mentioned faults) from single network layer independently, then
cooperate to locate the root cause through manually analyzing multi-
layer topology data and service data, thus fault demarcation becomes
more complex and time-consuming in multi-layer scenario than in
single-layer scenario.
With the help of incident management, the management systems first
automatically analyze root cause of the alarms at each single network
layer and report corresponding incidents to the upper layer, then
multi-layer management system comprehensively analyzes the topology
relationship and service relationship between the root causes of both
layers. The inner relationship among the alarms will be identified
and finally the root cause will be located among multiple layers. By
cooperating with the integrated Optical time-domain reflectometer
(OTDR) within the network device, we can determine the target optical
exchange station before site visits. Therefore, the overall fault
demarcation process is simplified and automated, the analyze result
could be reported and visualized in time. In this case, operation
teams only have to confirm the analyze result and dispatch site
engineers to perform relative maintenance actions (e.g., splice
fiber) based on the root cause.
3.4. Security events Automated Noise reduction based on Situation
awareness
In the continuous data driven monitoring, tools used by the Security
Operation Center (SoC) scan the network 24/7 to flag any
abnormalities or suspicious activities. As the SoC adds more tools
for security events detection, the volume of security events or
alerts grows continually. the overwhelming number of threat alerts
can cause threat fatigue. In addition, many of these alerts do not
provide sufficient intelligence, context to investigate, or are false
positives. False positives not only drain time and resources, but
can also distract teams from real incidents.
Feng, et al. Expires 2 August 2024 [Page 7]
Internet-Draft Network Incident Management January 2024
With the help of the incident management, BERT(Bidirectional Encoder
Representations from Transformers)[BERT] classifier can be adopted to
analyses the suspicious activity and understands the significance of
the gathered data (through both facts and inferences) and help
operation and maintenance engineers focus on handling important
security events and avoid wasting resources on false alerts, e.g.,
automatically determine whether 10,000 network security events are
real incidents and give reasonable explanations. Progressively,
Llama interpreter can be used to explain the reason why such alerts
are picked out and marked significant, what could be the potential
security implications that exist yet remain undiscovered.
4. Incident Management Architecture
+----------------------+-------------------+
| |
| Incident Management Client |
| |
| |
+------------+---------+---------+---------+
^ | | |
|Incident |Incident |Incident |Incident
|Report |Ack |Diagnose |Resolve
| | | |
| V V V
+--+-------------------+---------+----------+
| |
| |
| Incident Management Server |
| |
| |
| |
| |
+-------------------------------+-----------+
^ ^Abnormal ^
|Alarm |Operations |Metrics
|Report |Report |/Telemetry
| | V
+------------+-------+-----------------+----------------+
| |
| Network |
| |
+------------------------------------+------------------+
Figure 1: Incident Management Architecture
Feng, et al. Expires 2 August 2024 [Page 8]
Internet-Draft Network Incident Management January 2024
Figure 1 illustrates the incident management architecture. Two key
components for the incident management are incident management client
and incident management server.
Incident management server can be deployed in network analytics
platform, controllers and provides functionalities such as incident
identification, report, diagnosis, resolution, querying for incident
lifecycle management.
Incident management client can be deployed in the network OSS or
other business systems of operators and invokes the functionalities
provided by incident management server to meet the business
requirements of fault management.
A typical workflow of incident management is as follows:
* Some alarms or abnormal operations, network performance metrics
are reported from the network. Incident management server
receives these alarms/abnormal operations/metrics and try to
analyze the correlation of them, if the incidents are identified,
it will be reported to the client. The impact of network services
will be also analyzed and will update the incident.
* Incident management client receives the incident raised by server,
and acknowledge it. Client may invoke the 'incident diagnose' rpc
to diagnose this incident to find the root causes.
* If the root causes have been found, the client can resolve this
incident by invoking the 'incident resolve' rpc operation,
dispatching a ticket or using other functions (e.g. routing
calculation,configuration)
4.1. Interworking with Alarm Management
Feng, et al. Expires 2 August 2024 [Page 9]
Internet-Draft Network Incident Management January 2024
+-----------------------------+
| OSS |
|+-------+ +-----------+ |
||alarm | | incident | |
||handler| | client | |
|+-------+ +-----------+ |
+---^---------------^---------+
| |
|alarm |incident
+---|---------------|---------+
| | controller | |
| | | |
|+--+---++ +-----------+ |
||alarm | | | |
||process+----->| incident | |
|| |alarm | server | |
|+------++ +-----------+ |
| ^ ^ |
+---+--------------|----------+
|alarm | metrics/trace/etc.
| |
+---+--------------+----------+
| network |
| |
+-----------------------------+
Figure 2: Interworking with alarm management
YANG model for the alarm management[RFC8632] defines a standard
interface to manage the lifecycle of alarms. Alarms represent the
undesirable state of network resources, alarm data model also defines
the root causes and impacted services fields, but there may lack
sufficient information to determine them in lower layer system
(mainly in devices level), so alarms do not always tell the status of
services or the root causes. As described in [RFC8632], alarm
management act as a starting point for high-level fault management.
While incident management often works at the network level, so it's
possible to have enough information to perform correlation and
service impact analysis. Alarms can work as one of data sources of
incident management and may be aggregated into few amount of
incidents by correlation analysis, network service impact and root
causes may be determined during incident process.
Incident also contains some related alarms,if needed users can query
the information of alarms by alarm management interface [RFC8632].
In some cases, e.g. cutover scenario, incident server may use alarm
management interface [RFC8632] to shelve some alarms.
Feng, et al. Expires 2 August 2024 [Page 10]
Internet-Draft Network Incident Management January 2024
Alarm management may keep the original process, alarms are reported
from network to network controller or analytics and then reported to
upper layer system(e.g. OSS). Upper layer system may store these
alarms and provide the information for fault analysis (e.g. deeper
analysis based on incident).
Compared with alarm management, incident management provides not only
incident reporting but also diagnosis and resolution functions, it's
possible to support self-healing and may be helpful for single-domain
closed-loop control.
Incident management is not a substitute for alarm management.
Instead, they can work together to implement fault management.
4.2. Interworking with SAIN
+----------------+
| incident client|
+----------------+
^
|incident
+-------+--------+
|incident server |
+----------------+
^
|symptoms
+-------+--------+
| SAIN |
| |
+----------------+
^
|metrics
+-------------+-------------+
| |
| network |
| |
+---------------------------+
Figure 3: Interworking with SAIN
SAIN [RFC9417] defines the architecture of network service assurance.
A network service can be decomposed into some sub-services, and some
metrics can be monitored for sub-services. For example, a tunnel
service can be decomposed into some peer tunnel interface sub-
services and IP connectivity sub-service. If some metrics are
evaluated to indicate unhealthy for specific sub-service, some
symptoms will be present. Incident server may identify the incident
Feng, et al. Expires 2 August 2024 [Page 11]
Internet-Draft Network Incident Management January 2024
based on symptoms, and then report it to upper layer system. So,
SAIN can be one way to identify incident, services, sub-services and
metrics can be preconfigured via APIs defined by service assurance
YANG model [RFC9418] and incident will be reported if symptoms match
the condition of incident.
4.3. Relationship with RFC8969
[RFC8969] defines a framework for network automation using YANG, this
framework breaks down YANG modules into three layers, service layer,
network layer and device layer, and contains service deployment,
service optimization/assurance, and service diagnosis. Incident
works at the network layer and aggregates alarms, metrics and other
information from device layer, it's helpful to provfide service
assurance. And the incident diagnosis may be one way of service
diagnosis.
4.4. Relationship with Trace Context
W3C defines a common trace context[W3C-Trace-Context] for distributed
system tracing, [I-D.rogaglia-netconf-trace-ctx-extension] defines a
netconf extension for [W3C-Trace-Context] and
[I-D.quilbeuf-opsawg-configuration-tracing] defines a mechanism for
configuration tracing. If some errors occur when services are
deploying, it's very easy to identify these errors by distributed
system tracing, and an incident should be reported.
5. Functional Interface Requirements between the Client and the Server
5.1. Incident Identification
Feng, et al. Expires 2 August 2024 [Page 12]
Internet-Draft Network Incident Management January 2024
+--------------+
+--| Incident1 |
| +--+-----------+
| | +-----------+
| +--+ alarm1 |
| | +-----------+
| |
| | +-----------+
| +--+ alarm2 |
| | +-----------+
| |
| | +-----------+
| +--+ alarm3 |
| +-----------+
| +--------------+
+--| Incident2 |
| +--+-----------+
| | +-----------+
| +--+ metric1 |
| | +-----------+
| | +-----------+
| +--+ metric2 |
| +-----------+
|
| +--------------+
+--| Incident3 |
+--+-----------+
| +-----------+
+--+ alarm1 |
| +-----------+
|
| +-----------+
+--| metric1 |
+-----------+
Figure 4: Incident Identification
As described in Figure 4, multiple alarms, metrics, or hybrid can be
aggregated into an incident after analysis.
The incident management server MUST be capable of identifying
incidents. Multiple alarms, metrics and other information are
reported to incident server, and the server must analyze it and find
out the correlations of them, if the correlation match the incident
rules, incident will be identified and reported to the client.
Service impact analysis will be performed if an indent is identified,
and the content of incident will be updated if impacted network
services are detected.
Feng, et al. Expires 2 August 2024 [Page 13]
Internet-Draft Network Incident Management January 2024
AI/ML may be used to identify the incident. Expert system and online
learning can help AI to identify the correlation of alarms, metrics
and other information by time-base correlation algorithm, topo-based
correlation algorithm, etc. For example, if interface is down, then
many protocol alarms will be reported, AI will think these alarms
have some correlations. These correlations will be put into
knowledge base, and the incident will be identified faster according
to knowledge base next time.
As mentioned above, SAIN is another way to implement incident
identification. Observation timestamp defined in
[I-D.tgraf-yang-push-observation-time] and trace context defined in
[W3C-Trace-Context] may be helpful for incident identification.
+----------------------+
| |
| Orchestrator |
| |
+----+-----------------+
^VPN A Unavailable
|
+---+----------------+
| |
| Controller |
| |
| |
+-+-+-+-----+--+-----+
^ ^ ^
IGP | |Interface |IGP Peer
Down | |Down | Abnormal
| | |
VPN A | | |
+-----------------+-+------------+------------------*
| \ +---+ ++-++ +-+-+ +---+ /|
| \ | | | | | | | | / |
| \|PE1+-------| P1+X--------|P2 +--------|PE2|/ |
| +---+ +---+ +---+ +---+ |
+---------------------------------------------------+
Figure 5: Example 1 of Incident Identification
As described in Figure 5, vpn a is deployed from PE1 to PE2, if a
interface of P1 is going down, many alarms are triggered, such as
interface down, igp down, and igp peer abnormal from P2. These
alarms are aggregated and analyzed by controller, and the incident
'vpn unavailable' is triggered by the controller.
Feng, et al. Expires 2 August 2024 [Page 14]
Internet-Draft Network Incident Management January 2024
+----------------------+
| |
| Orchestrator |
| |
+----+-----------------+
^VPN A Degradation
|
+---+----------------+
| |
| controller |
| |
| |
+-+-+-+-----+--+-----+
^ ^
|Packet |Path Delay
|Loss |
| |
VPN A | |
+-------------------+------------+-------------------+
| \ +---+ ++-++ +-+-+ +---+ / |
| \ | | | | | | | | / |
| \|PE1+-------|P1 +---------|P2 +--------|PE2|/ |
| +---+ +---+ +---+ +---+ |
+----------------------------------------------------+
Figure 6: Example 2 of Incident Identification
As described in Figure 6, controller collect the network metrics from
network elements, it finds the packet loss of P1 and the path delay
of P2 exceed the thresholds, an incident 'VPN A degradation' may be
triggered after service impact analysis.
5.2. Incident Diagnosis
After an incident is reported to the incident management client, the
client MAY diagnose the incident to determine the root cause. Some
diagnosis operations may affect the running network services. The
client can choose not to perform that diagnosis operation after
determining the impact is not trivial. The incident management
server can also perform self-diagnosis. However, the self-diagnosis
MUST not affect the running network services. Possible diagnosis
methods include link reachability detection, link quality detection,
alarm/log analysis, and short-term fine-grained monitoring of network
quality metrics, etc.
Feng, et al. Expires 2 August 2024 [Page 15]
Internet-Draft Network Incident Management January 2024
5.3. Incident Resolution
After the root cause is diagnosed, the client MAY resolve the
incident. The client MAY choose resolve the incident by invoking
other functions, such as routing calculation function, configuration
function, dispatching a ticket or asking the server to resolve it.
Generally, the client would attempt to directly resolve the root
cause. If the root cause cannot be resolved, an alternative solution
SHOULD be required. For example, if an incident caused by a physical
component failure, it cannot be automatically resolved, the standby
link can be used to bypass the faulty component.
Incident server will monitor the status of incident, if the faults
are fixed, the server will update the status of incident to
'cleared', and report the updated incident to the client.
Incident resolution may affect the running network services. The
client can choose not to perform those operations after determining
the impact is not trivial.
6. Incident Data Model Concepts
6.1. Identifying the Incident Instance
An incident ID is used as an identifier of an incident instance, if
an incident instance is identified, a new incident ID is created.
The incident ID MUST be unique in the whole system.
6.2. The Incident Lifecycle
6.2.1. Incident Instance Lifecycle
From an incident instance perspective, an incident can have the
following lifecycle: 'raised', 'updated', 'cleared'. When an
incident is generated, the status is 'raised'. If the status changes
after the incident is generated, (for example, self-diagnosis,
diagnosis command issued by the client, or any other condition causes
the status to change but does not reach the 'cleared' level.) , the
status changes to 'updated'. When an incident is successfully
resolved, the status changes to 'cleared'.
Feng, et al. Expires 2 August 2024 [Page 16]
Internet-Draft Network Incident Management January 2024
6.2.2. Operator Incident Lifecycle
From an operator perspective, the lifecycle of an incident instance
includes 'acknowledged', 'diagnosed', and 'resolved'. When an
incident instance is generated, the operator SHOULD acknowledge the
incident. And then the operator attempts to diagnose the incident
(for example, find out the root cause and affected components).
Diagnosis is not mandatory. If the root cause and affected
components are known when the incident is generated, diagnosis is not
required. After locating the root cause and affected components,
operator can try to resolve the incident.
7. Incident Data Model Design
7.1. Overview
There are two YANG modules in the model. The first module, "ietf-
incident-type", provides common definitions such as incident domain,
incident category, incident priority. The second module, "ietf-
incident", defines technology independent abstraction of network
incident construct for alarm, log, performance metrics, etc. The
information reported in the incident include Root cause,
priority,impact, suggestion, etc. At the top of "ietf-incident"
module is the Network Incident. Network incident is represented as a
list and indexed by "incident-id". Each Network Incident is
associated with a service instance, domain and sources. Under
sources, there is one or more sources. Each source corresponds to
node defined in the network topology model and network resource in
the network device,e.g., interface. In addition, "ietf-incident"
support one general notification to report incident state changes and
three rpcs to manage the network incidents.
Feng, et al. Expires 2 August 2024 [Page 17]
Internet-Draft Network Incident Management January 2024
module: ietf-incident
+--ro incidents
+--ro incident* [incident-id]
+--ro incident-id string
+--ro csn? uint64
+--ro service-instance* string
+--ro name? string
+--ro type? enumeration
+--ro domain? identityref
+--ro priority? int:incident-priority
+--ro status? enumeration
+--ro ack-status? enumeration
+--ro category? identityref
+--ro detail? string
+--ro resolve-advice? string
+--ro sources
...
+--ro root-causes
...
+--ro root-events
...
+--ro events
...
+--ro raise-time? yang:date-and-time
+--ro occur-time? yang:date-and-time
+--ro clear-time? yang:date-and-time
+--ro ack-time? yang:date-and-time
+--ro last-updated? yang:date-and-time
rpcs:
+---x incident-acknowledge
...
+---x incident-diagnose
...
+---x incident-resolve
notifications:
+---n incident-notification
+--ro incident-id?
-> /inc:incidents/inc:incident/inc:incident-id
...
+--ro time? yang:date-and-time
7.2. Incident Notifications
Feng, et al. Expires 2 August 2024 [Page 18]
Internet-Draft Network Incident Management January 2024
notifications:
+---n incident-notification
+--ro incident-id?
-> /inc:incidents/inc:incident/inc:incident-id
+--ro csn? uint64
+--ro service-instance* string
+--ro name? string
+--ro type? enumeration
+--ro domain? identityref
+--ro priority? int:incident-priority
+--ro status? enumeration
+--ro ack-status? enumeration
+--ro category? identityref
+--ro detail? string
+--ro resolve-advice? string
+--ro sources
| +--ro source* [node]
| +--ro node -> /nw:networks/nw:network/nw:node/nw:node-id
| +--ro resource* [name]
| +--ro name al:resource
+--ro root-causes
| +--ro root-cause* [node]
| +--ro node -> /nw:networks/nw:network/nw:node/nw:node-id
| +--ro resource* [name]
| | +--ro name al:resource
| | +--ro cause-name? string
| | +--ro detail? string
| +--ro cause-name? string
| +--ro detail? string
+--ro root-events
| +--ro root-event* [type event-id]
| +--ro type -> ../../../events/event/type
| +--ro event-id leafref
+--ro events
| +--ro event* [type event-id]
| +--ro type enumeration
| +--ro event-id string
| +--ro (event-type-info)?
| +--:(alarm)
| | +--ro alarm
| | +--ro resource? leafref
| | +--ro alarm-type-id? leafref
| | +--ro alarm-type-qualifier? leafref
| +--:(notification)
| +--:(log)
| +--:(KPI)
| +--:(unknown)
+--ro time? yang:date-and-time
Feng, et al. Expires 2 August 2024 [Page 19]
Internet-Draft Network Incident Management January 2024
A general notification, incident-notification, is provided here.
When an incident instance is identified, the notification will be
sent. After a notification is generated, if the incident management
server performs self diagnosis or the client uses the interfaces
provided by the incident management server to deliver diagnosis and
resolution actions, the notification update behavior is triggered,
for example, the root cause objects and affected objects are updated.
When an incident is successfully resolved, the status of the incident
would be set to 'cleared'.
7.3. Incident Acknowledge
+---x incident-acknowledge
| +---w input
| | +---w incident-id*
| | -> /inc:incidents/inc:incident/inc:incident-id
After an incident is generated, updated, or cleared, (In some
scenarios where automatic diagnosis and resolution are supported, the
status of an incident may be updated multiple times or even
automatically resolved.) The operator needs to confirm the incident
to ensure that the client knows the incident.
The incident-acknowledge rpc can confirm multiple incidents at a time
7.4. Incident Diagnose
+---x incident-diagnose
| +---w input
| | +---w incident-id*
| | -> /inc:incidents/inc:incident/inc:incident-id
After an incident is generated, incident diagnose rpc can be used to
diagnose the incident and locate the root causes. Diagnosis can be
performed on some detection tasks, such as BFD detection, flow
detection, telemetry collection, short-term threshold alarm,
configuration error check, or test packet injection.
After the diagnosis is performed, a incident update notification will
be triggered to report the latest status of the incident.
7.5. Incident Resolution
+---x incident-resolve
+---w input
| +---w incident-id*
| -> /inc:incidents/inc:incident/inc:incident-id
Feng, et al. Expires 2 August 2024 [Page 20]
Internet-Draft Network Incident Management January 2024
After the root causes and impacts are determined, incident-resolve
rpc can be used to resolve the incident (if the server can resolve
it). How to resolve an incident instance is out of the scope of this
document.
Incident resolve rpc allows multiple incident instances to be
resolved at a time. If an incident instance is successfully
resolved, a notification will be triggered to update the incident
status to 'cleared'. If the incident content is changed during this
process, a notification update will be triggered.
8. Incident Management YANG Module
<CODE BEGINS>
file="ietf-incident-types@2023-05-16.yang"
module ietf-incident-types {
yang-version "1.1";
namespace "urn:ietf:params:xml:ns:yang:ietf-incident-types";
prefix "int";
import ietf-network {
prefix nw;
reference
"RFC 8345: A YANG Data Model for Network Topologies";
}
organization
"IETF OPSAWG Working Group";
contact
"WG Web: <https://datatracker.ietf.org/wg/opsawg/>;
WG List: <mailto:opsawg@ietf.org>
Author: Chong Feng <mailto:mailto:fengchonglly@gmail.com>
Author: Tong Hu <mailto:hutong@cmhi.chinamobile.com>
Author: Luis Miguel Contreras Murillo <mailto:
luismiguel.contrerasmurillo@telefonica.com>
Author : Thomas Graf <mailto:thomas.graf@swisscom.com>
Author : Qin Wu <mailto:bill.wu@huawei.com>
Author: Chaode Yu <mailto:yuchaode@huawei.com>
Author: Nigel Davis <mailto:ndavis@ciena.com>";
description
"This module defines the identities and typedefs for
incident management.
Copyright (c) 2022 IETF Trust and the persons identified as
authors of the code. All rights reserved.
Redistribution and use in source and binary forms, with or
Feng, et al. Expires 2 August 2024 [Page 21]
Internet-Draft Network Incident Management January 2024
without modification, is permitted pursuant to, and subject
to the license terms contained in, the Revised BSD License
set forth in Section 4.c of the IETF Trust's Legal Provisions
Relating to IETF Documents
(https://trustee.ietf.org/license-info).
This version of this YANG module is part of RFC XXXX; see the
RFC itself for full legal notices. ";
revision 2023-05-16 {
description "initial version";
reference "RFC XXX: Yang module for incident management.";
}
//identities
identity incident-domain {
description "The abstract identity to indicate the domain of
an incident.";
}
identity single-domain {
base incident-domain;
description "single domain.";
}
identity access {
base single-domain;
description "access domain.";
}
identity ran {
base access;
description "radio access network domain.";
}
identity transport {
base single-domain;
description "transport domain.";
}
identity otn {
base transport;
description "optical transport network domain.";
}
identity ip {
base transport;
description "ip transport network domain.";
}
identity ptn {
base ip;
description "packet transport network domain.";
}
identity cross-domain {
base incident-domain;
description "cross domain.";
Feng, et al. Expires 2 August 2024 [Page 22]
Internet-Draft Network Incident Management January 2024
}
identity incident-category {
description "The abstract identity for incident category.";
}
identity device {
base incident-category;
description "device category.";
}
identity power-environment {
base device;
description "power system category.";
}
identity device-hardware {
base device;
description "hardware of device category.";
}
identity device-software {
base device;
description "software of device category";
}
identity line-card {
base device-hardware;
description "line card category.";
}
identity maintenance {
base incident-category;
description "maintenance category.";
}
identity network {
base incident-category;
description "network category.";
}
identity protocol {
base incident-category;
description "protocol category.";
}
identity overlay {
base incident-category;
description "overlay category";
}
identity vm {
base incident-category;
description "vm category.";
}
//typedefs
typedef incident-priority {
type enumeration {
Feng, et al. Expires 2 August 2024 [Page 23]
Internet-Draft Network Incident Management January 2024
enum critical {
description "the incident MUST be handled immediately.";
}
enum high {
description "the incident should be handled as soon as
possible.";
}
enum medium {
description "network services are not affected, or the
services are slightly affected,but corrective
measures need to be taken.";
}
enum low {
description "potential or imminent service-affecting
incidents are detected,but services are
not affected currently.";
}
}
description "define the priority of incident.";
}
typedef node-ref {
type leafref {
path "/nw:networks/nw:network/nw:node/nw:node-id";
}
description "reference a network node.";
}
}
<CODE ENDS>
<CODE BEGINS>
file="ietf-incident@2023-05-16.yang"
module ietf-incident {
yang-version 1.1;
namespace "urn:ietf:params:xml:ns:yang:ietf-incident";
prefix inc;
import ietf-yang-types {
prefix yang;
reference
"RFC 6991: Common YANG Data Types";
}
import ietf-alarms {
prefix al;
reference
"RFC 8632: A YANG Data Model for Alarm Management";
}
import ietf-incident-types {
Feng, et al. Expires 2 August 2024 [Page 24]
Internet-Draft Network Incident Management January 2024
prefix int;
reference
"draft-feng-opsawg-incident-management: Incident
Management for Network Services";
}
organization
"IETF OPSAWG Working Group";
contact
"WG Web: <https://datatracker.ietf.org/wg/opsawg/>;
WG List: <mailto:opsawg@ietf.org>
Author: Chong Feng <mailto:fengchonglly@gmail.com>
Author: Tong Hu <mailto:hutong@cmhi.chinamobile.com>
Author: Luis Miguel Contreras Murillo <mailto:
luismiguel.contrerasmurillo@telefonica.com>
Author: Thomas Graf <mailto:thomas.graf@swisscom.com>
Author : Qin Wu <mailto:bill.wu@huawei.com>
Author: Chaode Yu <mailto:yuchaode@huawei.com>
Author: Nigel Davis <mailto:ndavis@ciena.com>";
description
"This module defines the interfaces for incident management
lifecycle.
This module is intended for the following use cases:
* incident lifecycle management:
- incident report: report incident instance to client
when an incident instance is detected.
- incident acknowledge: acknowledge an incident instance.
- incident diagnose: diagnose an incident instance.
- incident resolve: resolve an incident instance.
Copyright (c) 2022 IETF Trust and the persons identified as
authors of the code. All rights reserved.
Redistribution and use in source and binary forms, with or
without modification, is permitted pursuant to, and subject
to the license terms contained in, the Revised BSD License
set forth in Section 4.c of the IETF Trust's Legal Provisions
Relating to IETF Documents
(https://trustee.ietf.org/license-info).
This version of this YANG module is part of RFC XXXX; see the
RFC itself for full legal notices. ";
revision 2023-05-16 {
description "remove identies and typedefs to independent yang
module. update some definitions of data model.";
reference "RFC XXX: Yang module for incident management.";
}
revision 2023-03-13 {
Feng, et al. Expires 2 August 2024 [Page 25]
Internet-Draft Network Incident Management January 2024
description "initial version";
reference "RFC XXX: Yang module for incident management.";
}
//groupings
grouping resources-info {
description "the grouping which defines the network
resources of a node.";
leaf node {
type int:node-ref;
description "reference to a network node.";
}
list resource {
key name;
description "the resources of a network node.";
leaf name {
type al:resource;
description "network resource name.";
}
}
}
grouping incident-time-info {
description "the grouping defines incident time information.";
leaf raise-time {
type yang:date-and-time;
description "the time when an incident instance is raised.";
}
leaf occur-time {
type yang:date-and-time;
description "the time when an incident instance occurs.
It's the occurrence time of the first event during
incident detection.";
}
leaf clear-time {
type yang:date-and-time;
description "the time when an incident instance is
resolved.";
}
leaf ack-time {
type yang:date-and-time;
description "the time when an incident instance is
acknowledged.";
}
leaf last-updated {
type yang:date-and-time;
description "the latest time when an incident instance is
updated";
Feng, et al. Expires 2 August 2024 [Page 26]
Internet-Draft Network Incident Management January 2024
}
}
grouping incident-info {
description "the grouping defines the information of an
incident.";
leaf csn {
type uint64;
mandatory true;
description "The sequence number of the incident instance.";
}
leaf-list service-instance {
type string;
description "the related network service instances of
the incident instance.";
}
leaf name {
type string;
mandatory true;
description "the name of an incident.";
}
leaf type {
type enumeration {
enum fault {
description "It indicates the type of the incident
is a fault, for example an interface
fails to work.";
}
enum potential-risk {
description "It indicates the type of the incident
is a potential risk, for example high
CPU rate may cause a fault in the
future.";
}
}
mandatory true;
description "The type of an incident.";
}
leaf domain {
type identityref {
base int:incident-domain;
}
mandatory true;
description "the domain of an incident.";
}
leaf priority {
type int:incident-priority;
mandatory true;
Feng, et al. Expires 2 August 2024 [Page 27]
Internet-Draft Network Incident Management January 2024
description "the priority of an incident instance.";
}
leaf status {
type enumeration {
enum raised {
description "an incident instance is raised.";
}
enum updated {
description "the information of an incident instance
is updated.";
}
enum cleared {
description "an incident is cleared.";
}
}
default raised;
description "The status of an incident instance.";
}
leaf ack-status {
type enumeration {
enum acknowledged {
description "The incident has been acknowledged by user.";
}
enum unacknowledged {
description "The incident hasn't been acknowledged.";
}
}
default unacknowledged;
description "the acknowledge status of an incident.";
}
leaf category {
type identityref {
base int:incident-category;
}
mandatory true;
description "The category of an incident.";
}
leaf detail {
type string;
description "detail information of this incident.";
}
leaf resolve-advice {
type string;
description "The advice to resolve this incident.";
}
container sources {
description "The source components.";
Feng, et al. Expires 2 August 2024 [Page 28]
Internet-Draft Network Incident Management January 2024
list source {
key node;
uses resources-info;
min-elements 1;
description "The source components of incident.";
}
}
container root-causes{
description "The root cause objects.";
list root-cause {
key node;
description "the root causes of incident.";
grouping root-cause-info {
description "The information of root cause.";
leaf cause-name {
type string;
description "the name of cause";
}
leaf detail {
type string;
description "the detail information of the cause.";
}
}
uses resources-info {
augment resource {
description "augment root cause information.";
//if root cause object is a resource of a node
uses root-cause-info;
}
}
//if root cause object is a node
uses root-cause-info;
}
}
container root-events {
description "the root events of the incident.";
list root-event {
key "type event-id";
description "the root event of the incident.";
leaf type {
type leafref {
path "../../../events/event/type";
}
description "the event type.";
}
leaf event-id {
type leafref {
Feng, et al. Expires 2 August 2024 [Page 29]
Internet-Draft Network Incident Management January 2024
path "../../../events/event[type = current()/../type]"
+"/event-id";
}
description "the event identifier, such as uuid,
sequence number, etc.";
}
}
}
container events {
description "related events.";
list event {
key "type event-id";
description "related events.";
leaf type {
type enumeration {
enum alarm {
description "alarm type";
}
enum inform {
description "inform type";
}
enum KPI {
description "KPI type";
}
enum unknown {
description "unknown type";
}
}
description "event type.";
}
leaf event-id {
type string;
description "the event identifier, such as uuid,
sequence number, etc.";
}
choice event-type-info {
description "event type information.";
case alarm {
when "type = 'alarm'";
container alarm {
description "alarm type event.";
leaf resource {
type leafref {
path "/al:alarms/al:alarm-list/al:alarm"
+"/al:resource";
}
description "network resource.";
reference "RFC 8632: A YANG Data Model for Alarm
Feng, et al. Expires 2 August 2024 [Page 30]
Internet-Draft Network Incident Management January 2024
Management";
}
leaf alarm-type-id {
type leafref {
path "/al:alarms/al:alarm-list/al:alarm"
+"[al:resource = current()/../resource]"
+"/al:alarm-type-id";
}
description "alarm type id";
reference "RFC 8632: A YANG Data Model for Alarm
Management";
}
leaf alarm-type-qualifier {
type leafref {
path "/al:alarms/al:alarm-list/al:alarm"
+"[al:resource = current()/../resource]"
+"[al:alarm-type-id = current()/.."
+"/alarm-type-id]/al:alarm-type-qualifier";
}
description "alarm type qualitifier";
reference "RFC 8632: A YANG Data Model for Alarm
Management";
}
}
}
case notification {
//TODO
}
case log {
//TODO
}
case KPI {
//TODO
}
case unknown {
//TODO
}
}
}
}
}
//data definitions
container incidents {
config false;
description "the information of incidents.";
list incident {
key incident-id;
Feng, et al. Expires 2 August 2024 [Page 31]
Internet-Draft Network Incident Management January 2024
description "the information of incident.";
leaf incident-id {
type string;
description "the identifier of an incident instance.";
}
uses incident-info;
uses incident-time-info;
}
}
// notifications
notification incident-notification {
description "incident notification. It will be triggered when
the incident is raised, updated or cleared.";
leaf incident-id {
type leafref {
path "/inc:incidents/inc:incident/inc:incident-id";
}
description "the identifier of an incident instance.";
}
uses incident-info;
leaf time {
type yang:date-and-time;
description "occur time of an incident instance.";
}
}
// rpcs
rpc incident-acknowledge {
description "This rpc can be used to acknowledge the specified
incidents.";
input {
leaf-list incident-id {
type leafref {
path "/inc:incidents/inc:incident/inc:incident-id";
}
description "the identifier of an incident instance.";
}
}
}
rpc incident-diagnose {
description "This rpc can be used to diagnose the specified
incidents. The result of diagnosis will be reported
by incident notification.";
input {
leaf-list incident-id {
type leafref {
path "/inc:incidents/inc:incident/inc:incident-id";
}
Feng, et al. Expires 2 August 2024 [Page 32]
Internet-Draft Network Incident Management January 2024
description
"the identifier of an incident instance.";
}
}
}
rpc incident-resolve {
description "This rpc can be used to resolve the specified
incidents. The result of resolution will be reported
by incident notification.";
input {
leaf-list incident-id {
type leafref {
path "/inc:incidents/inc:incident/inc:incident-id";
}
description
"the identifier of an incident instance.";
}
}
}
}
<CODE ENDS>
9. IANA Considerations
9.1. The "IETF XML" Registry
This document registers one XML namespace URN in the 'IETF XML
registry', following the format defined in [RFC3688].
URI: urn:ietf:params:xml:ns:yang:ietf-incident
Registrant Contact: The IESG.
XML: N/A, the requested URIs are XML namespaces.
9.2. The "YANG Module Names" Registry
This document registers one module name in the 'YANG Module Names'
registry, defined in [RFC6020].
name: ietf-incident
prefix: inc
namespace: urn:ietf:params:xml:ns:yang:ietf-incident
RFC: XXXX
// RFC Ed.: replace XXXX and remove this comment
Feng, et al. Expires 2 August 2024 [Page 33]
Internet-Draft Network Incident Management January 2024
10. Security Considerations
The YANG modules specified in this document define a schema for data
that is designed to be accessed via network management protocol such
as NETCONF [RFC6241] or RESTCONF [RFC8040]. The lowest NETCONF layer
is the secure transport layer, and the mandatory-to-implement secure
transport is Secure Shell (SSH) [RFC6242]. The lowest RESTCONF layer
is HTTPS, and the mandatory-to-implement secure transport is TLS
[RFC8446].
The Network Configuration Access Control Model (NACM) [RFC8341]
provides the means to restrict access for particular NETCONF or
RESTCONF users to a preconfigured subset of all available NETCONF or
RESTCONF protocol operations and content.
There are a number of data nodes defined in this YANG module that are
writable/creatable/deletable (i.e., config true, which is the
default). These data nodes may be considered sensitive or vulnerable
in some network environments. Write operations (e.g., edit-config)
to these data nodes without proper protection can have a negative
effect on network operations. These are the subtrees and data nodes
and their sensitivity/vulnerability:
Some of the readable data nodes in this YANG module may be considered
sensitive or vulnerable in some network environments. It is thus
important to control read access (e.g., via get, get-config, or
notification) to these data nodes. These are the subtrees and data
nodes and their sensitivity/vulnerability:
Some of the RPC operations in this YANG module may be considered
sensitive or vulnerable in some network environments. It is thus
important to control access to these operations. These are the
operations and their sensitivity/vulnerability:
11. Contributors
MingShuang Jin
Huawei Technologies
jinmingshuang@huawei.com
Chunchi Liu
Huawei Technologies
liuchunchi@huawei.com
Aihua Guo
Futurewei Technologies
aihuaguo.ietf@gmail.com
Feng, et al. Expires 2 August 2024 [Page 34]
Internet-Draft Network Incident Management January 2024
Zhidong Yin
Huawei
yinzhidong@huawei.com
Guoxiang Liu
Huawei
liuguoxiang@huawei.com
Kaichun Wu
Huawei
wukaichun@huawei.com
Yanlei Zheng
China Unicom
zhengyanlei@chinaunicom.cn
Yunbin Xu
CAICT
xuyunbin@caict.ac.cn
Xing Zhao
CAICT
zhaoxing@caict.ac.cn
12. Acknowledgments
The authors would like to thank Mohamed Boucadair, Robert Wilton,
Benoit Claise, Oscar Gonzalez de Dios, Adrian Farrel, Mahesh
Jethanandani, Balazs Lengyel, Bo Wu, Qiufang Ma, Haomian Zheng,
YuanYao for their valuable comments and great input to this work.
13. References
13.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC3688] Mealling, M., "The IETF XML Registry", BCP 81, RFC 3688,
DOI 10.17487/RFC3688, January 2004,
<https://www.rfc-editor.org/info/rfc3688>.
[RFC6020] Bjorklund, M., Ed., "YANG - A Data Modeling Language for
the Network Configuration Protocol (NETCONF)", RFC 6020,
DOI 10.17487/RFC6020, October 2010,
<https://www.rfc-editor.org/info/rfc6020>.
Feng, et al. Expires 2 August 2024 [Page 35]
Internet-Draft Network Incident Management January 2024
[RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language",
RFC 7950, DOI 10.17487/RFC7950, August 2016,
<https://www.rfc-editor.org/info/rfc7950>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/info/rfc8174>.
13.2. Informative References
[BERT] "BERT (language model)",
<https://en.wikipedia.org/wiki/BERT_(language_model)>.
[I-D.ietf-ippm-pam]
Mirsky, G., Halpern, J. M., Min, X., Clemm, A., Strassner,
J., and J. François, "Precision Availability Metrics for
Services Governed by Service Level Objectives (SLOs)",
Work in Progress, Internet-Draft, draft-ietf-ippm-pam-09,
1 December 2023, <https://datatracker.ietf.org/doc/html/
draft-ietf-ippm-pam-09>.
[I-D.quilbeuf-opsawg-configuration-tracing]
Quilbeuf, J., Claise, B., Graf, T., Lopez, D., and S.
Qiong, "External Trace ID for Configuration Tracing", Work
in Progress, Internet-Draft, draft-quilbeuf-opsawg-
configuration-tracing-02, 10 July 2023,
<https://datatracker.ietf.org/doc/html/draft-quilbeuf-
opsawg-configuration-tracing-02>.
[I-D.rogaglia-netconf-trace-ctx-extension]
Gagliano, R., Larsson, K., and J. Lindblad, "NETCONF
Extension to support Trace Context propagation", Work in
Progress, Internet-Draft, draft-rogaglia-netconf-trace-
ctx-extension-03, 6 July 2023,
<https://datatracker.ietf.org/doc/html/draft-rogaglia-
netconf-trace-ctx-extension-03>.
[I-D.tgraf-yang-push-observation-time]
Graf, T., Claise, B., and A. H. Feng, "Support of Network
Observation Timestamping in YANG Notifications", Work in
Progress, Internet-Draft, draft-tgraf-yang-push-
observation-time-00, 4 March 2023,
<https://datatracker.ietf.org/doc/html/draft-tgraf-yang-
push-observation-time-00>.
Feng, et al. Expires 2 August 2024 [Page 36]
Internet-Draft Network Incident Management January 2024
[RFC8345] Clemm, A., Medved, J., Varga, R., Bahadur, N.,
Ananthakrishnan, H., and X. Liu, "A YANG Data Model for
Network Topologies", RFC 8345, DOI 10.17487/RFC8345, March
2018, <https://www.rfc-editor.org/info/rfc8345>.
[RFC8632] Vallin, S. and M. Bjorklund, "A YANG Data Model for Alarm
Management", RFC 8632, DOI 10.17487/RFC8632, September
2019, <https://www.rfc-editor.org/info/rfc8632>.
[RFC8969] Wu, Q., Ed., Boucadair, M., Ed., Lopez, D., Xie, C., and
L. Geng, "A Framework for Automating Service and Network
Management with YANG", RFC 8969, DOI 10.17487/RFC8969,
January 2021, <https://www.rfc-editor.org/info/rfc8969>.
[RFC9375] Wu, B., Ed., Wu, Q., Ed., Boucadair, M., Ed., Gonzalez de
Dios, O., and B. Wen, "A YANG Data Model for Network and
VPN Service Performance Monitoring", RFC 9375,
DOI 10.17487/RFC9375, April 2023,
<https://www.rfc-editor.org/info/rfc9375>.
[RFC9417] Claise, B., Quilbeuf, J., Lopez, D., Voyer, D., and T.
Arumugam, "Service Assurance for Intent-Based Networking
Architecture", RFC 9417, DOI 10.17487/RFC9417, July 2023,
<https://www.rfc-editor.org/info/rfc9417>.
[RFC9418] Claise, B., Quilbeuf, J., Lucente, P., Fasano, P., and T.
Arumugam, "A YANG Data Model for Service Assurance",
RFC 9418, DOI 10.17487/RFC9418, July 2023,
<https://www.rfc-editor.org/info/rfc9418>.
[TMF724A] TMF, "Incident Management API Profile v1.0.0", 3 April
2023, <https://www.tmforum.org/resources/standard/tmf724a-
incident-management-api-profile-v1-0-0/>.
[W3C-Trace-Context]
W3C, "W3C Recommendation on Trace Context", 23 November
2021, <https://www.w3.org/TR/2021/REC-trace-context-
1-20211123/>.
Appendix A. Changes between revisions
[RFC editor: please remove this section before publication.]
v03 - v04
* Update incident defintion based on TMF incident API profile
specification.
Feng, et al. Expires 2 August 2024 [Page 37]
Internet-Draft Network Incident Management January 2024
* Update use case on Multi-layer Fault Demarcation based on side
meeting discussion and IETF 119 session discussion.
* Update section 5.1 to explain how network incident is generated
based on other factors.
* Add one new use cases on Security Events noise reduction based on
Situation Awareness.
* Other Editorial changes.
v02 - v03
* Add one new use cases on Incident Generation.
* Add reference to Precision Availability Metric defined in IPPM PAM
WG document.
v01 - v02
* A few Editorial change to YANG data models in section 8.
* Add some text to the model design overview.
* Revise sample use cases section to focus on two key use cases.
* Motivation and goal clarification in the introduction section.
v00 - v01
* Modify the introduction.
* Rename incident agent to incident server.
* Add the interworking with alarm management.
* Add the interworking with SAIN.
* Add the relationship with RFC8969.
* Add the relationship with observation timestamp and trace context.
* Clarify the incident identification process.
* Modify the work flow of incident diagnosis and resolution.
* Remove identities and typedefs from ietf-incident YANG module, and
create a new YANG module called ietf-incident-types.
Feng, et al. Expires 2 August 2024 [Page 38]
Internet-Draft Network Incident Management January 2024
* Modify ietf-incident YANG module, for example, modify incident-
diagnose rpc and incident-resolve rpc.
Authors' Addresses
Chong Feng (editor)
101 Software Avenue, Yuhua District
Nanjing
Jiangsu, 210012
China
Email: fengchonglly@gmail.com
Tong Hu
China Mobile (Hangzhou) Information Technology Co., Ltd
Building A01, 1600 Yuhangtang Road, Wuchang Street, Yuhang District
Hangzhou
ZheJiang, 311121
China
Email: hutong@cmhi.chinamobile.com
Luis Miguel Contreras Murillo
Telefonica I+D
Madrid
Spain
Email: luismiguel.contrerasmurillo@telefonica.com
Thomas Graf
Swisscom
Binzring 17
CH-8045 Zurich
Switzerland
Email: thomas.graf@swisscom.com
Qin Wu
Huawei
101 Software Avenue, Yuhua District
Nanjing
Jiangsu, 210012
China
Email: bill.wu@huawei.com
Chaode Yu
Huawei
Feng, et al. Expires 2 August 2024 [Page 39]
Internet-Draft Network Incident Management January 2024
Email: yuchaode@huawei.com
Nigel Davis
Ciena
Email: ndavis@ciena.com
Feng, et al. Expires 2 August 2024 [Page 40]