Internet DRAFT - draft-zhang-ietf-heterogeneous-data-representation

draft-zhang-ietf-heterogeneous-data-representation



Internet Engineering Task Force                              Y. Zhang
Internet Draft                                                   S. He
Intended status: Informational                                  Y. Chen
Expires: August 11, 2022                                      Z. Wang
                                                               H. Xia
                          Xi'an University of Posts & Telecommunications
                                                       February 11, 2022



     Unified representation method of heterogeneous data in industrial
                                 Internet
         draft-zhang-ietf-heterogeneous-data-representation-01.txt



Abstract

   With the advent of 5G era, sensing devices and mobile Internet
   devices in smart factories are everywhere, and a variety of
   industrial data from different spatial devices becomes widely
   available and interwoven. These data are usually generated by
   streaming, with huge differences in data sources and structures,
   massive scale, strong correlation and complicated relationship. The
   great richness of data makes the problem of how to quickly,
   accurately and deeply dig the hidden value behind the data more
   complicated than ever. The data generated in different fields are
   distributed in a variety of business systems, and these data have
   different structures and forms, so it is difficult to use an
   efficient form of unified analysis. Based on the data characteristics
   of heterogeneous data, the multi-source heterogeneous data fusion
   method is studied based on tensor.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF). Note that other groups may also distribute
   working documents as Internet-Drafts. The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on August 10, 2022.



Zhang, et al.          Expires August 10, 2022                [Page 1]

    Internet-Draft   representation of heterogeneous data     February 2022

Copyright Notice

   Copyright (c) 2022 IETF Trust and the persons identified as the
   document authors. All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document. Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document. Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.



Table of Contents


   1. Introduction...................................................2
   2. Unified data representation method.............................3
      2.1. Data Tensor Representation in industrial Internet.........3
         2.1.1. Unstructured Data Tensor Representation..............4
         2.1.2. Semi-structured Data Tensor Representation...........4
         2.1.3. Structured Data Tensor Representation................5
         2.1.4. Subtensor Fusion Method..............................5
         2.1.5. Unified Tensor Fusion Model..........................5
   3. Security Considerations........................................5
   4. IANA Considerations............................................6
   5. Conclusions....................................................6
   6. Informative References.........................................6
   7. Acknowledgments................................................6
   Authors' Addresses................................................6

1. Introduction

   Driven by emerging technologies such as big data, cloud computing and
   the Internet of Things, intelligent manufacturing is the
   manufacturing industry in today's world, in which intelligent factory
   and intelligent production are the core content of intelligent
   manufacturing. Smart factories that rely on big data and Internet of
   Things technologies provide services for a large number of industrial
   big data.

   Industrial big data has the 4V characteristics of broad big data,
   namely volume, variety, value and velocity. Besides, there are some
   other characteristics[1]:




Zhang, et al.          Expires August 10, 2022                [Page 2]

    Internet-Draft   representation of heterogeneous data     February 2022

   1. The data sources are extensive, and the proportion of semi-
      structured and unstructured data increases.

   2. There is a high correlation between the data.

   3. Data analysis should consider the characteristics of time and
      space.

   4. These data are specific to industrial scenarios.

   Big data mining aims to gain knowledge from large amounts of complex
   data and create new value. In the industrial production process, data
   comes from sensors, intelligent devices, workstations, production
   process data from production control systems, operation monitoring
   data, and log records of various production workshops distributed in
   different geographical locations. This data is part of the structured
   data. At the same time, there is a large amount of unstructured and
   semi-structured data, such as text, log files, and sound, images, and
   video. Data structure is different, attributes and standards are
   different, in order to transform data into knowledge, data warehouse,
   online analysis and processing and data mining techniques are needed.
   Traditional data storage and management methods are oriented to
   relational structured data. In this case, it is difficult to meet
   such a large and diversified demand for unstructured data analysis.
   And the data systems are independent of each other, due to different
   data sources, production equipment and software. Various factors such
   as diversity of component providers lead to different data formats,
   which makes information integration difficult to achieve[2]. Based on
   the above challenges, uniform data format specifications are
   required.

2. Unified data representation method

2.1. Data Tensor Representation in industrial Internet

   An industrial streaming big data processing framework is composed of
   three parts: data collection and representation, streaming data
   processing and storage, data analysis and services. Therefore, data
   collection and uniform representation are prerequisites for the
   following steps.

   In the process of big data acquisition, a variety of sensing devices
   collect unstructured data, semi-structured data and structured data
   in different fields, form a data stream and submit it to the edge
   computing layer for tension-quantization representation. During the
   submission process, the source data format is not changed.






Zhang, et al.          Expires August 10, 2022                [Page 3]

    Internet-Draft   representation of heterogeneous data     February 2022

   The edge computing layer consists of edge computing nodes with
   computing performance. For example, camera network in smart factory,
   sensor network, smart air conditioning, smart TV and so on. These
   edge nodes collect data from the Internet of Things terminals and
   provide a certain amount of data computing power, as well as a
   certain amount of data storage. Based on the work of Ref[3],
   different types of data from different spaces at the edge nodes can
   be constructed into corresponding sub-tensors by tensor model. The
   sub-tensors mentioned above are usually different in dimensions and
   characteristics, and are independent of each other. In order to
   carry out association analysis and deep mining of overall data, the
   above different low-order sub-tensors are combined by using tensor
   extension operator, and different data features are arranged into
   tensor spaces of different orders. Finally, a unified representation
   model of high-order big data is established.

   The following takes the detection video (unstructured data), related
   detection logs (semi-structured data) and relevant index table
   (structured data) recorded in the spot defect detection platform of
   SMT surface mount technology based on vision as an example to
   introduce the data tensor representation method in edge calculation.

2.1.1. Unstructured Data Tensor Representation

   In the sub-perception representation method of unstructured video
   data, take solder joint defect detection video based on SMT surface
   mount technology of intelligent terminal platform as an example, the
   main features of video data are time frame, frame image width, frame
   image height and frame image color. Therefore, video data in MP4
   format can be represented as fourth-order sub-tensor data in low-
   order space. The element values in the tensors are encoded videos.
   The video frame, the width of each frame, the length of each frame
   and the gamut of the image are respectively converted into different
   orders.

2.1.2. Semi-structured Data Tensor Representation

   In the sub-sensing representation method of semi-structured log data,
   the solder joint detection log is established by taking the visual
   SMT solder joint defect detection on intelligent terminal platform as
   an example. A database of semi-structured data is a set of nodes.
   Each node is a leaf or internal node. Each semi-structured data set
   has a hierarchy that can be decomposed into a tree structure. Solder
   joint detection log can be expressed as third-order sub-tensor data
   in low-order space. The row of the identification matrix, the column
   of the identification matrix, and the encoding of the element
   represent different orders of the third-order tensor, respectively.





Zhang, et al.          Expires August 10, 2022                [Page 4]

    Internet-Draft   representation of heterogeneous data     February 2022

2.1.3. Structured Data Tensor Representation

   In the subperceptive representation method of structured database
   table, the attribute detection record form based on visual SMT solder
   joint defect detection of intelligent terminal platform is taken as
   an example. Structured data is data that is logically expressed and
   implemented by a two-dimensional table structure, mainly managed and
   stored through a relational database. In a simple type of database
   table, a field is ofen represented by a number or a characters, so
   that it can be represented as a matrix. More complex field types can
   be represented as a tensor by adding new orders. Attribute detection
   record form can be expressed as fifth-order sub-tensor data in low-
   order space.

   ID, date, record, num, state, and errornum represent different orders
   of the fifth-order tensor, respectively.

2.1.4. Subtensor Fusion Method

   The tensor fusion extension operator is first defined. The order of a
   tensor can be extended in different directions to the order of the
   existing tensor space. If new heterogeneous data is added, it is
   added to the original tensor space in the form of new feature order.
   If this feature order already exists, it is extended in the form of
   dimension. In practical application, different heterogeneous data is
   first expressed as low-order sub-tensors, and then integrated into
   higher-order tensor space by extension operator, so as to achieve
   uniform representation of heterogeneous data.

2.1.5. Unified Tensor Fusion Model

   To reduce data redundancy and duplication, the subtensor is converted
   into a uniform tensor using a uniform data tensor function. When two
   tensors have the same property order, the finer granularity order is
   retained, while the order of different properties is maintained. The
   structured data, semi-structured data and unstructured data are first
   represented in low-order subtensor space, and then they are fused and
   unified into higher-order tensors by tensor extension operators,
   which correspond to unified variable data structures in computer
   systems.

   After the heterogeneous data is represented by unified tensor fusion,
   the subsequent streaming data analysis and processing applications
   can be carried out.

3. Security Considerations

   In unsupervised or harsh environments, edge computing nodes may
   produce counterfeit data to change the overall fusion results and
   affect the accuracy and reliability of the final results. Therefore,


Zhang, et al.          Expires August 10, 2022                [Page 5]

    Internet-Draft   representation of heterogeneous data     February 2022

   sensors in edge computing nodes play an important role in the fusion
   results, and they need to be protected from attacks in the whole
   process.

4. IANA Considerations

   This document has no actions for IANA.

5. Conclusions

   In the process of industrial production, terminal data collected by
   edge computing nodes vary in structure and form, including
   structured, semi-structured and unstructured data. In order to mine
   the valuable problems hidden behind these data, the corresponding
   sub-tensors are constructed for different types of data through
   tensor model, and then the tensor extension operator is used to
   combine these sub-tensors to build the unified tensor model of
   high-order big data.

6. Informative References

   [1]  W. Jianmin, "Survey on industrial big data", Big Data Research,
         vol. 3, no. 6, pp. 3-14, 2017.

   [2]  Wentao, H. E. , and C. Shao, "The development and challenges of
         industrial big data analysis technology", Information and
         Control, vol. 47, no. 4, pp. 398-410, 2018.

   [3]  Kuang, L. , Hao, . , Yang, L. T. , Lin, M. , Luo, C. , and G.
         Min, "A tensor-based approach for big data representation and
         dimensionality reduction",  IEEE Transactions on Emerging
         Topics in Computing, vol. 2, no. 3, pp. 280-291, 2017.

7. Acknowledgments

   TBD.

Authors' Addresses

   Yaqian Zhang
   Xi'an University of Posts & Telecommunications
   Shaanxi
   China

   Email: zhangyaqian0701@126.com







Zhang, et al.          Expires August 10, 2022                [Page 6]

    Internet-Draft   representation of heterogeneous data     February 2022

   Shengsheng He
   Xi'an University of Posts & Telecommunications
   Shaanxi
   China

   Email: 513286954@qq.com


   Yanping Chen
   Xi'an University of Posts & Telecommunications
   Shaanxi
   China

   Email: chenyp@xupt.edu.cn


   Zhongmin Wang
   Xi'an University of Posts & Telecommunications
   Shaanxi
   China

   Email: zmwang@xupt.edu.cn


   Hong Xia
   Xi'an University of Posts & Telecommunications
   Shaanxi
   China

   Email: xiahong@xupt.edu.cn





















Zhang, et al.          Expires August 10, 2022                [Page 7]