Internet Research Task Force                                      Y. Cui
Internet-Draft                                                    Y. Wei
Intended status: Informational                                     Z. Xu
Expires: 17 October 2023                             Tsinghua University
                                                                  P. Liu
                                                                   Z. Du
                                                            China Mobile
                                                           15 April 2023

      Graph Neural Network Based Modeling for Digital Twin Network


   This draft introduces the scenarios and requirements for performance
   modeling of digital twin networks, and explores the implementation
   methods of network models, proposing a network modeling method based
   on graph neural networks (GNNs).  This method combines GNNs with
   graph sampling techniques to improve the expressiveness and
   granularity of the model.  The model is generated through data
   training and validated with typical scenarios.  The model performs
   well in predicting QoS metrics such as network latency, providing a
   reference option for network performance modeling methods.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Definition of Terms . . . . . . . . . . . . . . . . . . . . .   3
   3.  Scenarios, Requirements and Challenges of Network Modeling for
           DTN . . . . . . . . . . . . . . . . . . . . . . . . . . .   3
     3.1.  Scenarios . . . . . . . . . . . . . . . . . . . . . . . .   3
     3.2.  Requirements  . . . . . . . . . . . . . . . . . . . . . .   3
     3.3.  Main Challenges . . . . . . . . . . . . . . . . . . . . .   4
   4.  Modeling Digital Twin Networks  . . . . . . . . . . . . . . .   5
     4.1.  Consideration/Analysis on Network Modeling Methods  . . .   5
     4.2.  Network Modeling Framework  . . . . . . . . . . . . . . .   6
     4.3.  Building a Network Model  . . . . . . . . . . . . . . . .   7
       4.3.1.  Networking System as a Relation Graph . . . . . . . .   7
       4.3.2.  Message-passing on the Heterogeneous Graph  . . . . .   7
       4.3.3.  State Transition Learning . . . . . . . . . . . . . .   8
       4.3.4.  Model Training  . . . . . . . . . . . . . . . . . . .   9
     4.4.  Model Performance in Data Center Networks and Wide Area
           Networks  . . . . . . . . . . . . . . . . . . . . . . . .   9
       4.4.1.  QoS Inference in Data Center Networks . . . . . . . .   9
       4.4.2.  Time-Series Prediction in Data Center Networks  . . .  10
       4.4.3.  Steady-State QoS Inference in Wide Area Networks  . .  10
   5.  Conclusion  . . . . . . . . . . . . . . . . . . . . . . . . .  10
   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  10
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  11
   8.  Informative References  . . . . . . . . . . . . . . . . . . .  11
   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  11
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  11

1.  Introduction

   Digital twin networks are virtual images (or simulations) of physical
   network infrastructures that can help network designers achieve
   simplified, automated, elastic, and full-lifecycle operations.  The
   task of network modeling is to predict how network performance
   metrics, such as throughput and latency, change in various "what-if"
   scenarios[I-D.irtf-nmrg-network-digital-twin-arch], such as changes
   in traffic conditions and reconfigurations of network devices.  In
   this paper, we propose a network performance modeling framework based

   on graph neural networks, which supports modeling various network
   configurations including topology, routing, and caching, and can make
   time-series predictions of flow-level performance metrics.

2.  Definition of Terms

   This document makes use of the following terms:

   DTN:  Digital twin networks.

   GNN:  Graph neural network.

   NGN:  Networking Graph Networks.

3.  Scenarios, Requirements and Challenges of Network Modeling for DTN

3.1.  Scenarios

   Digital twin networks are digital virtual mappings of physical
   networks, and some of their main applications include network
   technology experiments, network configuration validation, network
   performance optimization, etc.  All of these applications require
   accurate network models in the twin network to enable precise
   simulation and prediction of the functionality and performance
   characteristics of the physical network.

   This document mainly focuses on network performance modeling, while
   scope of this document.

3.2.  Requirements

   Physical networks are composed of various network elements and links
   between them, and different network elements have different
   functionalities and performance characteristics.  In the early
   planning stages of the network lifecycle, the physical network does
   not fully exist, but the network owner hopes to predict the network's
   capabilities and effects based on the network model and its
   simulation, to determine whether the network can meet the future
   application requirements running on it, such as network throughput
   capacity and network latency requirements, and to build the network
   at the optimal cost.  During the network operation stage, network
   performance modeling can work in conjunction with the online physical
   network to achieve network changes and optimization, and reduce
   network operation risks and costs.  Therefore, network modeling
   requires the ability of various performance-related factors in the
   physical network and striving for accuracy as much as possible.  This
   aspects:

   (1) In order to produce accurate predictions, a network model must
   have sufficient expressiveness to include as many influencing factors
   related to network performance indicators as possible.  Otherwise, it
   will inevitably fail to generalize more general network environments.
   Among these factors, network configuration can span various different
   levels of operation from end hosts to network devices.  For example,
   congestion control at the host level, scheduling strategies, and
   active queue management at the queue level, bandwidth and propagation
   delay at the link level, shared buffer management strategies at the
   device level, as well as topology and routing schemes at the network
   factors.

   (2) In different network scenarios, the granularity of concern for
   operators may vary greatly.  In wide area network scenarios,
   operators primarily focus on the long-term average performance of
   aggregated traffic, where path-level steady-state modeling is usually
   sufficient to guide the planning process (e.g., traffic engineering).
   In local area networks and cloud data center networks, operators are
   more concerned with meeting performance metrics such as latency and
   throughput, as well as network infrastructure utilization.  However,
   fine-grained network performance observation is a goal that network
   operators and cloud providers continuously strive for, in order to
   provide precise information about when and which traffic is being
   interfered with.  This requires network models to support flow-level
   time series performance prediction.

3.3.  Main Challenges

   (1) Challenges related to the large state space.  Corresponding to
   the requirement of expressiveness of the large state space, the
   number of potential scenarios that the network model faces is large.
   This is because network systems typically consist of dozens to
   hundreds of network nodes, each of which may contain multiple
   configurations, leading to an explosion in the combination of
   potential states.  One simple solution to build a network model is to
   construct a large neural network that takes flat feature vectors
   containing all configuration information as input.  However, the
   input size of such a neural network is fixed, and it cannot be scaled
   to handle information from an arbitrary number of nodes and
   configurations.  The final complexity of the neural network will
   increase with the number of configurations, making it difficult to
   train and generalize.

   (2) Challenges related to modeling granularity.  Unlike aggregated
   end-to-end path-level traffic, the transmission behavior of flows
   undergoes cascading effects since it is typically controlled by some
   control loop (e.g., congestion control).  Once the configurations
   related to control (e.g., ECN threshold, queue buffer size) change
   during flow transmission, the resulting flow traffic measurements
   (e.g., throughput and packet loss) will experience significant
   changes, and the measured traffic state at this time will not reflect
   the results of these changes.  Therefore, predicting flow-level
   performance from traffic measurements may be more difficult than
   inferring QoS from traffic measurements.  Here, we use traffic
   measurements as input to predict the corresponding QoS, which we call
   "inference", while using traffic demand as another input together to
   output flow-level performance (e.g., flow completion time) in
   "prediction" for the hypothetical scenario.

4.  Modeling Digital Twin Networks

4.1.  Consideration/Analysis on Network Modeling Methods

   Traditional network modeling typically uses methods such as queuing
   theory and network calculus, which mainly model from the perspective
   of queues and their forwarding capabilities.  In the construction of
   operator networks, network elements come from different device
   vendors with varying processing capabilities, and these differences
   lack precise quantification.  Therefore, modeling networks built with
   these devices is a very complex task.  In addition to queue
   forwarding behavior, the network itself is also influenced by various
   configuration policies and related network features (such as ECN,
   Policy Routing, etc.), and coupled with the flexibility of network
   size, this method is difficult to adapt to the modeling requirements
   of digital twin networks.

   In recent years, the academic community has proposed data-driven
   graph neural network (GNN) methods, which extend existing neural
   networks for systems represented in graph form.  Networks themselves
   are a kind of graph structure, and GNNs can be used to learn the
   complex network behavior from the data.  The advantage of GNN is its
   ability to model non-linear relationships and adapt to different
   types of data, improving the expressiveness and granularity of
   network modeling.  By combining GNN with graph sampling techniques,
   the method improves the expressiveness and granularity of network
   models.  This method involves sampling subgraphs from the original
   network based on specific criteria, such as the degree of
   connectivity and centrality.  Then, these subgraphs are used to train
   a GNN model that captures the most relevant network features.
   Experimental results show that this method can improve the accuracy
   approaches.

   This document will introduce a method of network modeling using graph
   neural networks (GNNs) as a technical option for providing network
   modeling for DTN.

4.2.  Network Modeling Framework

   | +----------------+ | +----------------------+   +-----------------+
   | |    Intent      |-->|Network Graph Abstract|-->|NGN Configuration|
   | +----------------+ | +----------^-----------+   +-------+---------+
   |                    |            |                       |
   | +----------------+ |            |              +--------V---------+
   | |Domain Knowledge|--------------+              | State Transition |
   | +----------------+ |                           |Model Construction|
   |                    |                           +--------+---------+
   |                    |                                    |
   | +----------------+ |    +---------------+     +---------V---------+
   | |     Data       |----->|Model Training |<----| Network Model Desc|
   | +----------------+ |    +-------+-------+     +-------------------+
   |                    |            |
   |  Target Network    |    +-------V-------+
   +--------------------+    | Network Model |
           Figure 1: Network modeling design process

   Network modeling design process:

   1.  Before modeling, determine the network configurations and
   modeling granularity based on the modeling intent.

   2.  Use domain knowledge from network experts to abstract the network
   system into a network relationship graph to represent the complex
   relationships between different network entities.

   3.  Build the network model using configurable graph neural network
   modules and determine the form of the aggregation function based on
   the properties of the relationships.

   4.  Use a recurrent graph neural network to model the changes in
   network state between adjacent time steps.

   5.  Train the model parameters using the collected data.

4.3.  Building a Network Model

   Describing the process and results of network modeling, i.e., the
   four steps (Steps 2 to 5) in Section 4.2 of the network modeling
   design process.

4.3.1.  Networking System as a Relation Graph

   Representing a network system as a heterogeneous relationship graph
   (referred to as "graph" hereafter) to provide a unified interface to
   simulate various network configurations and their complex
   relationships.  Network entities related to performance are mapped as
   graph nodes with relevant characteristics.  Heterogeneous nodes
   represent different network entities based on their attributes or
   configurations.  Edges in the graph connect nodes that are considered
   directly related.  There are two types of nodes in the graph,
   physical nodes representing specific network entities with local
   configurations (e.g., switches with buffers of a certain size), and
   virtual nodes representing performance-related entities (e.g., flows
   or paths), thus allowing final performance metrics to be attached to
   the graph.  Edges reflect the relationships between entities and can
   be used to embed domain knowledge-induced biases.  Specifically,
   edges can be used to model local or global configurations.

4.3.2.  Message-passing on the Heterogeneous Graph

   Use Networking Graph Networks (NGN) [battaglia2018] as the
   fundamental building block for network modeling.  An NGN module is
   defined as a "graph-to-graph" module with heterogeneous nodes that
   takes an attribute graph as input and, after a series of message-
   passing steps, outputs another graph with different attributes.
   Attributes represent the features of nodes and are represented as
   tensors of fixed dimensions.  Each NGN block contains multiple
   configurable functions, such as aggregation, transformation, and
   update functions, which can be implemented using standard neural

   networks and shared among same-type nodes.  The aggregation function
   can take the form of a simple sum or an RNN, while the transformation
   function can map the information of heterogeneous nodes to the same
   hidden space of the target type nodes, allowing for unified
   operations in the update function and no limitation on the modeling
   capability of GNNs.

   One feed-forward NGN pass can be viewed as one step of message
   passing on the graph.  In each round of message passing, nodes
   aggregate same-type messages using the corresponding aggregation
   function and transform the aggregated messages using the type
   transformation function to handle heterogeneous nodes.  The
   transformed messages are then fed into the update function to update
   the node's state.  After a specified number of rounds of message
   metrics.

   Typically, NGNs first perform a global update and then independent
   local updates for nodes in each local domain.  Circular dependencies
   between different update operations can be resolved through multiple
   rounds of message passing.

4.3.3.  State Transition Learning

   The network model needs to support fine-grained prediction
   granularity and transient prediction (such as the state of a flow) at
   short time scales.  To achieve this, this document uses the recurrent
   form of the NGN module to learn to predict future states from the
   current state.  The model runs at a time step and has an "encoder-
   processor-decoder" structure.

                          | +--------------+  |
                          | | +----------+ |  |
   G_hidden(t-1)---^----->| +>| NGN_core |-+  |------+----->G_hidden(t)
                   |      |   +----------+    |      |
            +------+----+ |Message passing x M| +----V------+
   G_in(t)->|NGN_encoder| +-------------------+ |NGN_decoder|->G_out(t)
            +-----------+      Processor        +-----------+

            Figure 2: State transition learning

   These three components are NGN modules with the same abstract graph
   but different neural network parameters.

   Encoder: converts the input state into a fixed-dimensional vector,
   independently encoding different nodes, ignoring relationships
   between nodes, and not performing message passing.

   Processor: performs M rounds of message passing, with the input being
   the output of the encoder and the previous output of the processor;

   Decoder: independently decodes different nodes as the readout
   function, extracting dynamic information from the hidden graph,
   including the current performance metrics and the state used for the
   next step state update.  Note that the next graph G_(t+1) is updated
   according to G_out(t), which is not shown in Figure 2.

   To support state transition modeling, the model distinguishes between
   the static and dynamic features of the network system and represents
   them as different graphs.  The static graph contains the static
   configuration of the system, including physical node configurations
   (such as queue priorities and switch buffer sizes) and virtual node
   configurations (such as flow sizes).  The dynamic graph contains the
   temporary state of the system, mainly related to virtual nodes (such
   as the remaining size of a flow or end-to-end delay of a path).  In
   addition, when considering dynamic configurations (such as time-
   varying ECN thresholds), the actions taken (i.e., new configurations)
   should be placed in the dynamic graph and input at each time step.

4.3.4.  Model Training

   The L2 loss between the predicted values and the corresponding true
   values is used to supervise the output features of each node
   generated by the decoder for model training.  To generate long-term
   prediction trajectories, the model iteratively feeds back the updated
   absolute state prediction values to the model as input.  As a data
   preprocessing and postprocessing step, we standardized the input and
   output of the NGN model.

4.4.  Model Performance in Data Center Networks and Wide Area Networks

4.4.1.  QoS Inference in Data Center Networks

   This use case aims to verify whether the model can accurately perform
   time-series inference and generalize to unseen configurations,
   demonstrating the application of online performance monitoring.  The
   network model needs to infer the evolution of path-level latency in
   the time series given real-time measurements of traffic on the given
   path.  The datasets used in this scenario is generated by ns-3
   [NS-3].  Under specific experimental settings, the MAPE of path-level
   latency can be controlled below 7% [wang2022].

4.4.2.  Time-Series Prediction in Data Center Networks

   This use case verifies whether the model can provide flow-level time-
   series modeling capability under different configurations.  Unlike
   the previous case, the behavior of the network model in this case is
   like a network simulator, which needs to predict the Flow Completion
   Time (FCT) without traffic collection information, only using flow
   descriptions and static topology information as input.  The datasets
   used in this scenario is generated by ns-3 [NS-3].  Under specific
   experimental settings, the predicted FCT distribution matches the
   true distribution well, with a Pearson correlation coefficient of 0.9
   [wang2022].  In addition, the model can also predict throughput,
   latency, and other path/flow-level metrics in time-series prediction.
   This use case verifies the model's ability in time-series prediction,
   and theoretical analysis combined with experimental verification
   shows that the model does not have cumulative errors in long-term
   time-series prediction.

4.4.3.  Steady-State QoS Inference in Wide Area Networks

   This use case aims to verify that the model can work in the Wide Area
   Network (WAN) scenario and demonstrate that the model can effectively
   model and generalize to global and local configurations, which
   reflects the application of offline network planning.  It is worth
   noting that the WAN scenario has more topology changes compared to
   the data center network scenario, which imposes higher demand on the
   model's performance.  Public network modeling dataset [NM-D] is used
   in this scenario for evaluation.  Under specific experimental
   settings, the model is experimentally verified in three different WAN
   topologies, including NSFnet, GEANT2, and RedIRIS, and achieves a
   50th percentile APE of 10% for path-level latency, which is
   comparable to the performance of the domain-specific model RouteNet
   [rusek2019].  This use case verifies the model's generalization in
   topology and configuration and its versatility in the scenario.

5.  Conclusion

   This draft implements a network performance modeling method based on
   graph neural networks, addressing the problems and challenges in
   network modeling in terms of expressiveness and modeling granularity.
   The model's versatility and generalization are verified in typical
   achieved.

6.  Security Considerations


7.  IANA Considerations


Informative References

              Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-
              Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A.,
              Raposo, D., Santoro, A., Faulkner, R., and others,
              "Relational inductive biases, deep learning, and graph
              networks", 2018.

              Zhou, C., Yang, H., Duan, X., Lopez, D., Pastor, A., Wu,
              Q., Boucadair, M., and C. Jacquenet, "Digital Twin
              Network: Concepts and Reference Architecture", Work in
              Progress, Internet-Draft, draft-irtf-nmrg-network-digital-
              twin-arch-02, 24 October 2022,

   [NM-D]     "Network Modeling Datasets",

   [NS-3]     "Network Simulator, NS-3", <>.

              Rusek, K., Suarez-Varela, J., Mestres, A., Barlet-Ros, P.,
              and A. Cabellos-Aparicio, "Unveiling the potential of
              Graph Neural Networks for network modeling and
              optimization in SDN", 2019.

   [wang2022] Liu., M. W. L. H. Y. C. R. L. Z., "xNet: Improving
              Expressiveness and Granularity for Network Modeling with
              Graph Neural Networks. IEEE INFOCOM,", 2022.


Authors' Addresses

   Yong Cui
   Tsinghua University
   30 Shuangqing Rd, Haidian District

   Yunze Wei
   Tsinghua University
   30 Shuangqing Rd, Haidian District

   Zhiyong Xu
   Tsinghua University
   30 Shuangqing Rd, Haidian District

   Peng Liu
   China Mobile
   No.32 XuanWuMen West Street

   Zongpeng Du
   China Mobile
   No.32 XuanWuMen West Street

