<?xml version="1.0" encoding="US-ASCII"?>
<!-- This is built from a template for a generic Internet Draft. Suggestions for
     improvement welcome - write to Brian Carpenter, brian.e.carpenter @ gmail.com 
     This can be converted using the Web service at http://xml.resource.org/ -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<!-- You want a table of contents -->
<!-- Use symbolic labels for references -->
<!-- This sorts the references -->
<!-- Change to "yes" if someone has disclosed IPR for the draft -->
<!-- This defines the specific filename and version number of your draft (and inserts the appropriate IETF boilerplate -->
<?rfc sortrefs="yes"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc topblock="yes"?>
<?rfc comments="no"?>
<rfc category="info" docName="draft-zcz-nmrg-digitaltwin-data-collection-04"
     ipr="trust200902">
  <front>
    <title abbrev="Network Working Group">Data Collection Requirements and
    Technologies for Network Digital Twin</title>

    <author fullname="Cheng Zhou" initials="C." surname="Zhou">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>zhouchengyjy@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Danyang Chen" initials="D." surname="Chen">
      <organization>China Mobile</organization>

      <address>
        <postal>
          <street/>

          <city>Beijing</city>

          <code>100053</code>

          <country>China</country>
        </postal>

        <email>chendanyang@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Pedro Martinez-Julia" initials="P."
            surname="Martinez-Julia">
      <organization>NICT</organization>

      <address>
        <postal>
          <street>4-2-1, Nukui-Kitamachi, Koganei</street>

          <region>Tokyo</region>

          <code>184-8795</code>

          <country>Japan</country>
        </postal>

        <email>pedro@nict.go.jp</email>
      </address>
    </author>

    <author fullname="Qiufang Ma" initials="Q." surname="Ma">
      <organization>Huawei</organization>

      <address>
        <postal>
          <street/>

          <city>Nanjing</city>

          <code>210012</code>

          <country>China</country>
        </postal>

        <email>maqiufang1@huawei.com</email>
      </address>
    </author>

    <date year="2026"/>

    <area>Networking</area>

    <workgroup>Internet Research Task Force</workgroup>

    <keyword>Digtial Twin; Network Digital Twin; Data Collection</keyword>

    <abstract>
      <t>A Network Digital Twin is a virtual representation of a real network,
      which is meant to be used by a management system to analyze, diagnose,
      emulate, and then control the real network based on data, models, and
      interfaces. The construction and state update of a Network Digital Twin
      requires obtaining real-time information of the physical network it
      represents (i.e., telemetry data). This document aims to describe the
      data collection requirements and provide data collection methods or
      tools to build the data repository for building and updating a network
      digital twin.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <t>With the deployment of Internet of Things (IoT), cloud computing and
      data center, etc., the scale of the current network is expanded
      gradually. However, the increase of network scale also leads to an
      increase in the complexity of the current network, and it induces plenty
      of problems. In order to improve the autonomy ability of network and
      reduce potential negative effects on physical and virtual networks, we
      consider that an endogenous intelligent and autonomous network
      architecture which achieves self-optimization and decision is
      indispensable (in general, self-management and self-operation). The
      digital twin technology addresses the challenge of building
      self-management systems because it can optimize and validate policies
      through real-time and interactive mapping with physical entities <xref
      target="I-D.irtf-nmrg-network-digital-twin-arch"/>.</t>

      <t>Data is the cornerstone required for constructing a digital twin for
      a network, namely a Network Digital Twin (NDT). In the face of large
      network scale, data collection, storage and management are faced with
      great challenges. So, data collection methods and tools should meet the
      requirements of target-driven, diversity, lightweight and efficiency,
      while being open and standardized. Among all the requirements, achieving
      a lightweight and efficient data collection method is of the most
      importance. If the full-data collection method is adopted, huge storage
      space and bandwidth resource are needed, especially for complex
      scenarios that require real-time data and traffic from multi-source and
      heterogeneous devices. Therefore, it is extremely important to agree on
      lightweight and efficient data collection, aggregation, and correlation
      methods, toward building the transmission of monitoring information
      (telemetry data), processing, and storage required to build an effective
      NDT system.</t>

      <t>This document aims to describe the data collection requirements and
      propose efficient data collection methods or tools to build the data
      repository for network digital twin.</t>
    </section>

    <section title="Definitions and Acronyms">
      <t>PN: Physical Network</t>

      <t>IMC: Instruction Management Center</t>

      <t>DSC: Data Storage Center</t>

      <t>NDT: Network Digital Twin</t>

      <t>TSE: Telemetry Streaming Element</t>

      <t>RDF: Resource Description Framework</t>

      <t>CEP: Complex Event Processing</t>
    </section>

    <section title="Data Collection Requirements for Network Digital Twin">
      <section title="Target-driven and On-demand Collection">
        <t>The monitoring data of a network is the basis to build an NDT
        system. Such data is collected from physical and virtual networks. It
        includes, but is not limited to, the following types:<list
            style="symbols">
            <t>Provisional and operational status of physical or virtual
            devices, as well as the network topology with all network
            elements.</t>

            <t>Configuration data that is required to transform a network
            system from its initial default state into its current state.</t>

            <t>Running status of physical, logical, or virtual ports and
            links.</t>

            <t>Logs and events records of all the network elements.</t>

            <t>Statistics (packet loss, traffic throughput, latency, etc.) of
            flows and ports.</t>

            <t>Various data regarding users and services.</t>

            <t>Life-cycle operation data of all network elements.</t>

            <t>All above data in time series.</t>
          </list></t>

        <t>The collection of the monitoring information from a network
        required for maintaining an NDT (telemetry data) should be in
        target-driven and on-demand mode. It is not always necessary to
        collect all monitoring information from the network (telemetry data)
        listed above because of the high cost of resources (CPU, memory,
        bandwidth etc.). The type, frequency and method of data collection aim
        to meet the application of an NDT depend on the specific network
        topology and application requirements.</t>
      </section>

      <section title="Diverse Tools for Various Data Collection">
        <t>The different types of monitoring information required to maintain
        an NDT (telemetry data) have several characteristics. Some data (e.g.
        hardware status, environmental data, etc.) requires lower collecting
        frequency, while others (e.g. flow status, link fault, etc.) need
        higher level of real-time. Some data (e.g. device status, port
        statistics, etc.) can be collected directly and simply via normal
        tools, while others (e.g. per-flow latency, traffic matrix, etc.) can
        only be acquired through complex network measurement technologies. It
        is unrealistic to find or define a uniform data collection method that
        is suitable for all types of data. Therefore, multiple tools or
        methods are needed to collect the different types of data required to
        build the NDT entity.</t>
      </section>

      <section title="Lightweight and Efficient Collection">
        <t>Data collection tools and methods should be as lightweight as
        possible, so as to reduce the occupation of network equipment
        resources and ensure that data collection does not affect the normal
        operation of the network. The major requirements are listed as
        below.<list style="symbols">
            <t>Data collection tools and methods need to improve efficiency of
            execution, reduce the cost of computing, storage and communication
            bandwidth.</t>

            <t>The collection of redundant data should be avoided or
            minimized.</t>

            <t>For the data set that needs to be collected, making full use of
            the data compression technology, to reduce the resource cost in
            the collection phase. There must be lossy or lossless compression
            methods available to data sources, which will be applied together
            with other functions before data is transmitted.</t>
          </list></t>
      </section>

      <section title="Open and Standardized Interfaces">
        <t>Data collection interfaces used to build the NDT should be open and
        standardized to help avoid either hardware or software vendor lock,
        and facilitate inter-operability among different vendors. The major
        requirements of data collection interfaces are:<list style="symbols">
            <t>Support configuration management, including the data collection
            channel, frequency or period, etc.</t>

            <t>Support several rate options (e.g. minute-level, 10-second
            level, second level (near real time), and millisecond-level) to
            accommodate different data requirements from applications.</t>

            <t>Be extensible so that more features can be added in future with
            limited parameter changes and with backward compatibility.</t>

            <t>Be able to provide secure and reliable information exchange
            mechanism.</t>

            <t>Be able to enforce federation policies to allow information to
            be exchangeable among domains while ensuring authorization and
            scoping is controlled.</t>
          </list></t>
      </section>

      <section title="Naming for Caching">
        <t>Both raw monitoring information (telemetry data) and knowledge
        items obtained from monitoring must be able to be addressed uniquely.
        This means to give a unique identifier or "name" to each data or
        knowledge item that references it. This name will be used by caching
        mechanisms to store the data and provide it for clients that request
        it, which will also use such name.</t>

        <t>Global names and federated names must be supported. A name schema,
        name hierarchy, and name part ontology must be defined and maintained
        together with other naming systems, such as DNS for global names.</t>
      </section>

      <section title="Efficient Multi-Destination Delivery">
        <t>The maintenance of NDT systems will not be the sole purpose of
        monitoring information and knowledge communication. Other applications
        would also request raw monitoring information (telemetry data) or
        knowledge items. They can use the name to identify it. The monitoring
        system (telemetry system), following the recommendations of <xref
        target="RFC9232">RFC 9232</xref>, will deliver the requested data or
        knowledge items to the requesters as much efficiently as possible. On
        the one hand, items will be provided by the closest cache to the
        destination of the data. On the other hand, items will be replicated
        in the best nodes, following an efficient multi-cast spanning tree.
        Different underlying protocols can be used to achieve this
        mechanism.</t>

        <t>Delivering knowledge items instead of raw telemetry data enables
        digital twins to be aware of the context of data and highly relieve
        from complex processing, which will be performed by the entities which
        are best suited for running each type of processing.</t>
      </section>
    </section>

    <section title="Data Collection Technologies for Network Digital Twin">
      <section title="Existing Data Collection Methods/Tools">
        <t>Currently, some widely-used tools, such as SNMP, RESTCONF <xref
        target="RFC8040"/>, NETCONF <xref target="RFC6241"/>, Telemetry, INT
        (In-band Network Telemetry), DPI (Deep Packet Inspection), IPFIX <xref
        target="RFC7011"/>, etc. can be candidate tools to collect data for
        network digital twin. YANG data model and associated mechanisms
        defined in <xref target="RFC8639"/><xref target="RFC8641"/> enable
        subscriber-specific subscriptions to a publisher's event streams, and
        can help subscriber applications to request a continuous and
        customized stream of updates from a YANG datastore. Appendix-A in
        <xref target="RFC9232"/> gives a survey on existing network telemetry
        techniques, which explores an overview of management plane, control
        plane and data plane telemetry techniques and standards.</t>

        <t>Moreover, some new innovation methods can help increase the data
        collection efficiency. For example, <xref
        target="I-D.ietf-opsawg-collected-data-manifest"/> proposes a YANG
        model to store contextual information along with the collected data in
        order to keep the collected data exploitable; <xref target="RFC9506"/>
        addresses the network performance measurement problem under encrypted
        transport protocols, via proposing some hybrid measurement methods
        based on marking bits in packet headers without relying on external
        network management systems. <xref target="RFC7594"/> introduces a
        measurement method named Large-Scale Measurement of Broadband
        Performance (LMAP) that works in a coordinated fashion to perform
        network performance measurement tasks.</t>
      </section>

      <section title="Innovation Directions on Data Collection">
        <t>Current data collection methods and tools (YANG, xCONF, SNMP,
        Telemetry, etc.) listed above can help acquire network data to build
        an NDT system, which may be with low maturity and low-level
        capabilities of data service and data modelling. To build a more
        mature NDT system with high-level capabilities, it is necessary to
        explore more innovative data collection technologies. The following
        are several potential innovation directions. <list style="symbols">
            <t>High-performance data collection technology based on
            programmable circuits, which offer the potential for hardware
            acceleration and customization.</t>

            <t>Measurement methods for complex monitoring information such as
            network performance and network traffic.</t>

            <t>Distributed and collaborative data collection techniques for
            integrating and fusing data from multiple data sources, and the
            time synchronization problem of data acquisition.</t>

            <t>Provision of processed information, jointly and separately, by
            applying the function indicated by data requester.</t>

            <t>Assessment of federation policies in data provisioning to
            enable cross-domain data provision and implement multi-domain
            digital twin scenarios.</t>

            <t>Investigating self-adaptive and self-learning data collection
            techniques that can dynamically adjust data collection parameters,
            methods, and priorities based on network conditions and user
            requirements.</t>

            <t>Exploring machine learning and AI techniques to enhance the
            efficiency and accuracy of data collection processes by
            identifying patterns, correlations and anomalies in network
            data.</t>
          </list></t>
      </section>
    </section>

    <section title="Knowledge and Instruction Driven Data Collection Method for Network Digital Twin">
      <section title="Overview">
        <t>An NDT's data repository sub-system manages all network data, in
        real time, from the PN to the NDT. Sufficient and timely data are
        always required to construct the twin entity and various data models.
        However the existing methods collect the full data from the PN for
        modeling, and do not consider problems like time-lag, insufficient
        storage resources, low computational efficiency and waste of bandwidth
        resources caused by data transmission.</t>

        <t>This section proposes an efficient data collection method, named
        "knowledge and instruction driven data collection". This data
        collection method is based on sending instructions to the elements of
        the PN for them to pre-process the data (data cleaning or knowledge
        representation) before sending it back to be applied to the NDT.</t>
      </section>

      <section title="Efficient Data Collection Mechanism">
        <t>The management system structure consists of the PN and the NDT. The
        PN includes multiple Data Storage Centers (DSC) and Telemetry
        Streaming Element (TSE), and the NDT includes the Instruction
        Management Center (IMC) and Data Storage Center (DSC). The TSE has
        multiple functions, including data collection, data aggregation, data
        correlation, knowledge representation and query, etc. In addition, a
        Complex Event Processing (CEP) engine is integrated into TSE to
        perform queries to the streamed data. The IMC has two functions: one
        is used to manage the registration of the DSC in the PN side, and its
        registration information can include various key information such as
        the IP address of the DSC in the PN side, choose data type, and
        various index names in the data, data source name and data size, etc.
        The other is used to adaptively configure data collection instructions
        according to the collection requirements of the DSC in the NDT side
        and search for IP addresses to send instructions. The
        instruction-carrying information includes rule-based mathematical
        expressions, executable models in ".exe" format, dynamic collection
        frequency, parameter lists, program text files in ".m" format, text
        files with parameter configuration, and other types of files.
        Instructions are flexible and programmable, and can be created,
        modified, combined, and deleted at any time according to requirements.
        When the DSC of the NDT side requests data to the IMC, the IMC
        searches the IP address of the DSC in the database with the
        registration information, which is built according to critical
        information, such as data type and data name, and functional
        instructions for data processing or knowledge representation can be
        implemented depending on the demand configuration. The DSC of the NDT
        side stores the effective information after data processing and
        knowledge representation returned by the TSE.</t>

        <t>The DSC in the PN side has two functions. On the one hand, it
        stores data of various types, such as performance indicators,
        operational status, log, traffic scheduling, business requirements,
        etc. On the other hand, it has the function of automatically parsing
        the instructions sent by the TSE. Then the operating environment of
        the instruction is configured according to the instruction needs, and
        data processing or knowledge representation is performed based on the
        instruction. Data processing mainly includes data cleaning, filling
        missing data, normalization, conflict verification, etc. Knowledge
        representation refers to the representation of the original data as a
        data structure that can be used for efficient computation. Such
        representation results are similar to machine language, which is
        conducive to the rapid and accurate construction of the model. The
        role of knowledge representation is to represent the original data as
        a data structure that can be used to efficiently calculate.</t>

        <figure align="center" anchor="Fig_Data_Collection"
                title="Data Collection Process">
          <artwork align="center">+------------------------------+   +-----------------------+
|   Physical  Network          |   |  Netowrk Digital Twin |
| +-----+    +-----+  +------+ |   |  +------+  +-------+  |
| |     |    |     |  |      | |   |  |      |  |       |  |
| | DSC |... | DSC |  | TSE  | |   |  |  IMC |  |  DSC  |  |
| |     |    |     |  |      | |   |  |      |  |       |  |
| +-+---+    +--+--+  +---+--+ |   |  +---+--+  +----+--+  |
|   |           |         |    |   |      |          |     |
+------------------------------+   +-----------------------+
    |           |         |               |          |
    | 1.1. Register       |               |          |
    +-----------+---------&gt;               |          |
    |           |         |               |          |
    |           | 1.2. Register           |          |
    |           +---------&gt;               |          |
    |           |         | 1.3. Register |          |
    |           |         +---------------&gt;          |
    |           |         |             2. Data req. |
    |           |         |               &lt;----------+
    |           |         | 3. Query and instruction |
    |           |         |    configuration         |
    |           |         |               +          |
    |           |         4. Send instructions       |
    |           |         &lt;---------------+          |
    |           |         |               |          |
    |           |   5. Parse and execute  |          |
    |           |      instruction        |          |
    | 6. Data subscript.  |               |          |
    &lt;---------------------+               |          |
    | 7. Knowledge        |               |          |
    |    representation   |               |          |
    |     8. Data pushing |               |          |
    +---------------------&gt;               |          |
    |           | 9. Data aggregation and |          |
    |           |    correlation          |          |
    |           |         | 10. Send processed data  |
    |           |         +--------------------------&gt;   
    |           |         |               |          |</artwork>
        </figure>
      </section>

      <section title="Data Collection Process">
        <t>The specific process is as follows:<list style="symbols">
            <t>The DSC in the PN side registers into the TSE. The TSE
            registers into the IMC. Both provide their IP addresses, the data
            type, the data source, the data size, etc.</t>

            <t>The DSC in the NDT side sends the data collection request to
            the IMC.</t>

            <t>According to the data collection request, the IMC intelligently
            queries the registration addressing information and configures the
            data processing instruction.</t>

            <t>The IMC in the NDT side sends the corresponding instruction
            according to the query result to the TSE.</t>

            <t>After receiving the instructions, the TSE parses them and
            executes them. The query function can be performed by the CEP
            engine, which receives all monitoring information (telemetry data)
            and processes it with all queries provided.</t>

            <t>The TSE sends data subscription to DSC in the PN side.</t>

            <t>The DSC in the PN side represents the data semantically in RDF
            form or sends the data in raw form to the TSE for it to make the
            semantic representation.</t>

            <t>The DSC in the PN side pushes the data or knowledge item to the
            TSE.</t>

            <t>The TSE aggregates and correlates the collected data or
            knowledge items. Then, according to the actual needs, generates
            aggregated data or knowledge items.</t>

            <t>The TSE sends the resulting data or knowledge items to the DSC
            in the NDT side.</t>
          </list></t>
      </section>

      <section title="Query and Aggregation Functions">
        <t>The TSE supports an arbitrary number of queries and aggregation
        functions. As a minimum, it will support:<list style="symbols">
            <t>A function to apply a particular calculation to the values
            retrieved from a specified metric for a specified period of time.
            The basically supported calculations must be:<list style="symbols">
                <t>Average: Returns the single number resulting from averaging
                all values in the period.</t>

                <t>Maximum: Returns the single number that represents the
                highest value in the period.</t>

                <t>Minimum: Returns the single number that represents the
                lowest value in the period.</t>

                <t>Percentile X: Returns the percentile of calculated at
                position X (from 0, which is the minimum, to 100, which is the
                maximum).</t>

                <t>Moving Average X: Transforms all values of the specified
                period by calculating every value as the average of the
                previous X values (or less if there are not enough).</t>

                <t>Filter Previous X: Removes the values that change less than
                X percent from the previous value.</t>

                <t>Filter Average X: Removes the values that change less than
                X percent from the average value.</t>

                <t>Filter Moving Average X Y: Removes the values that change
                less than Y percent from the value of the moving average for X
                previous values.</t>
              </list></t>

            <t>A function to represent the collected values in a semantic
            structure following some ontology, information model, and data
            format (YANG). This will enforce semantic constraints to the
            values, such as avoiding negative measures of some parameters
            (e.g., bandwidth usage).</t>

            <t>A function to analyze the collected values to detect some
            pattern (provided) and, if so, trigger some notification that
            other module can use to execute some action.</t>
          </list></t>

        <t>The particular behavior of the three functions will be described in
        a high-level language that is transformed to the specific code used by
        the device, such as <xref target="P4"/>.</t>
      </section>
    </section>

    <section title="Summary">
      <t>This draft describes the requirements for data collection and
      provides the data collection methods or tools required to build the data
      repository for maintaining NDT systems. These data collection methods or
      tools should meet the requirement of target-driven, diversity,
      lightweight and efficiency, while being open and standardized. Among all
      the requirements, lightweight and efficiency requirements are the most
      important. Thus, this draft provides a lightweight and efficient method
      for data collection that is particularly optimized for maintaining NDT
      systems. Going forward, more methods (transformation and aggregation
      functions) and tools (solutions) shall be studied to extend the contents
      of this draft.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>TBD.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>This document has no requests to IANA.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>

      <?rfc include="reference.RFC.8639"?>

      <?rfc include="reference.RFC.8641"?>

      <?rfc include="reference.RFC.9232"?>
    </references>

    <references title="Informative References">
      <?rfc include="reference.RFC.6241"?>

      <?rfc include="reference.RFC.7011"?>

      <?rfc include="reference.RFC.7594"?>

      <?rfc include="reference.RFC.8040"?>

      <?rfc include="reference.RFC.9506"?>

      <?rfc include="reference.I-D.irtf-nmrg-network-digital-twin-arch"?>

      <?rfc include="reference.I-D.ietf-opsawg-collected-data-manifest"?>

      <reference anchor="P4">
        <front>
          <title>P4 Language Specification
          (https://p4.org/p4-spec/docs/P4-16-v-1.2.3.html)</title>

          <author>
            <organization>The P4 Language Consortium</organization>
          </author>

          <date day="11" month="July" year="2022"/>
        </front>
      </reference>
    </references>
  </back>
</rfc>
