Internet DRAFT - draft-zhang-ietf-heterogeneous-data-representation
draft-zhang-ietf-heterogeneous-data-representation
Internet Engineering Task Force Y. Zhang
Internet Draft S. He
Intended status: Informational Y. Chen
Expires: August 11, 2022 Z. Wang
H. Xia
Xi'an University of Posts & Telecommunications
February 11, 2022
Unified representation method of heterogeneous data in industrial
Internet
draft-zhang-ietf-heterogeneous-data-representation-01.txt
Abstract
With the advent of 5G era, sensing devices and mobile Internet
devices in smart factories are everywhere, and a variety of
industrial data from different spatial devices becomes widely
available and interwoven. These data are usually generated by
streaming, with huge differences in data sources and structures,
massive scale, strong correlation and complicated relationship. The
great richness of data makes the problem of how to quickly,
accurately and deeply dig the hidden value behind the data more
complicated than ever. The data generated in different fields are
distributed in a variety of business systems, and these data have
different structures and forms, so it is difficult to use an
efficient form of unified analysis. Based on the data characteristics
of heterogeneous data, the multi-source heterogeneous data fusion
method is studied based on tensor.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on August 10, 2022.
Zhang, et al. Expires August 10, 2022 [Page 1]
Internet-Draft representation of heterogeneous data February 2022
Copyright Notice
Copyright (c) 2022 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction...................................................2
2. Unified data representation method.............................3
2.1. Data Tensor Representation in industrial Internet.........3
2.1.1. Unstructured Data Tensor Representation..............4
2.1.2. Semi-structured Data Tensor Representation...........4
2.1.3. Structured Data Tensor Representation................5
2.1.4. Subtensor Fusion Method..............................5
2.1.5. Unified Tensor Fusion Model..........................5
3. Security Considerations........................................5
4. IANA Considerations............................................6
5. Conclusions....................................................6
6. Informative References.........................................6
7. Acknowledgments................................................6
Authors' Addresses................................................6
1. Introduction
Driven by emerging technologies such as big data, cloud computing and
the Internet of Things, intelligent manufacturing is the
manufacturing industry in today's world, in which intelligent factory
and intelligent production are the core content of intelligent
manufacturing. Smart factories that rely on big data and Internet of
Things technologies provide services for a large number of industrial
big data.
Industrial big data has the 4V characteristics of broad big data,
namely volume, variety, value and velocity. Besides, there are some
other characteristics[1]:
Zhang, et al. Expires August 10, 2022 [Page 2]
Internet-Draft representation of heterogeneous data February 2022
1. The data sources are extensive, and the proportion of semi-
structured and unstructured data increases.
2. There is a high correlation between the data.
3. Data analysis should consider the characteristics of time and
space.
4. These data are specific to industrial scenarios.
Big data mining aims to gain knowledge from large amounts of complex
data and create new value. In the industrial production process, data
comes from sensors, intelligent devices, workstations, production
process data from production control systems, operation monitoring
data, and log records of various production workshops distributed in
different geographical locations. This data is part of the structured
data. At the same time, there is a large amount of unstructured and
semi-structured data, such as text, log files, and sound, images, and
video. Data structure is different, attributes and standards are
different, in order to transform data into knowledge, data warehouse,
online analysis and processing and data mining techniques are needed.
Traditional data storage and management methods are oriented to
relational structured data. In this case, it is difficult to meet
such a large and diversified demand for unstructured data analysis.
And the data systems are independent of each other, due to different
data sources, production equipment and software. Various factors such
as diversity of component providers lead to different data formats,
which makes information integration difficult to achieve[2]. Based on
the above challenges, uniform data format specifications are
required.
2. Unified data representation method
2.1. Data Tensor Representation in industrial Internet
An industrial streaming big data processing framework is composed of
three parts: data collection and representation, streaming data
processing and storage, data analysis and services. Therefore, data
collection and uniform representation are prerequisites for the
following steps.
In the process of big data acquisition, a variety of sensing devices
collect unstructured data, semi-structured data and structured data
in different fields, form a data stream and submit it to the edge
computing layer for tension-quantization representation. During the
submission process, the source data format is not changed.
Zhang, et al. Expires August 10, 2022 [Page 3]
Internet-Draft representation of heterogeneous data February 2022
The edge computing layer consists of edge computing nodes with
computing performance. For example, camera network in smart factory,
sensor network, smart air conditioning, smart TV and so on. These
edge nodes collect data from the Internet of Things terminals and
provide a certain amount of data computing power, as well as a
certain amount of data storage. Based on the work of Ref[3],
different types of data from different spaces at the edge nodes can
be constructed into corresponding sub-tensors by tensor model. The
sub-tensors mentioned above are usually different in dimensions and
characteristics, and are independent of each other. In order to
carry out association analysis and deep mining of overall data, the
above different low-order sub-tensors are combined by using tensor
extension operator, and different data features are arranged into
tensor spaces of different orders. Finally, a unified representation
model of high-order big data is established.
The following takes the detection video (unstructured data), related
detection logs (semi-structured data) and relevant index table
(structured data) recorded in the spot defect detection platform of
SMT surface mount technology based on vision as an example to
introduce the data tensor representation method in edge calculation.
2.1.1. Unstructured Data Tensor Representation
In the sub-perception representation method of unstructured video
data, take solder joint defect detection video based on SMT surface
mount technology of intelligent terminal platform as an example, the
main features of video data are time frame, frame image width, frame
image height and frame image color. Therefore, video data in MP4
format can be represented as fourth-order sub-tensor data in low-
order space. The element values in the tensors are encoded videos.
The video frame, the width of each frame, the length of each frame
and the gamut of the image are respectively converted into different
orders.
2.1.2. Semi-structured Data Tensor Representation
In the sub-sensing representation method of semi-structured log data,
the solder joint detection log is established by taking the visual
SMT solder joint defect detection on intelligent terminal platform as
an example. A database of semi-structured data is a set of nodes.
Each node is a leaf or internal node. Each semi-structured data set
has a hierarchy that can be decomposed into a tree structure. Solder
joint detection log can be expressed as third-order sub-tensor data
in low-order space. The row of the identification matrix, the column
of the identification matrix, and the encoding of the element
represent different orders of the third-order tensor, respectively.
Zhang, et al. Expires August 10, 2022 [Page 4]
Internet-Draft representation of heterogeneous data February 2022
2.1.3. Structured Data Tensor Representation
In the subperceptive representation method of structured database
table, the attribute detection record form based on visual SMT solder
joint defect detection of intelligent terminal platform is taken as
an example. Structured data is data that is logically expressed and
implemented by a two-dimensional table structure, mainly managed and
stored through a relational database. In a simple type of database
table, a field is ofen represented by a number or a characters, so
that it can be represented as a matrix. More complex field types can
be represented as a tensor by adding new orders. Attribute detection
record form can be expressed as fifth-order sub-tensor data in low-
order space.
ID, date, record, num, state, and errornum represent different orders
of the fifth-order tensor, respectively.
2.1.4. Subtensor Fusion Method
The tensor fusion extension operator is first defined. The order of a
tensor can be extended in different directions to the order of the
existing tensor space. If new heterogeneous data is added, it is
added to the original tensor space in the form of new feature order.
If this feature order already exists, it is extended in the form of
dimension. In practical application, different heterogeneous data is
first expressed as low-order sub-tensors, and then integrated into
higher-order tensor space by extension operator, so as to achieve
uniform representation of heterogeneous data.
2.1.5. Unified Tensor Fusion Model
To reduce data redundancy and duplication, the subtensor is converted
into a uniform tensor using a uniform data tensor function. When two
tensors have the same property order, the finer granularity order is
retained, while the order of different properties is maintained. The
structured data, semi-structured data and unstructured data are first
represented in low-order subtensor space, and then they are fused and
unified into higher-order tensors by tensor extension operators,
which correspond to unified variable data structures in computer
systems.
After the heterogeneous data is represented by unified tensor fusion,
the subsequent streaming data analysis and processing applications
can be carried out.
3. Security Considerations
In unsupervised or harsh environments, edge computing nodes may
produce counterfeit data to change the overall fusion results and
affect the accuracy and reliability of the final results. Therefore,
Zhang, et al. Expires August 10, 2022 [Page 5]
Internet-Draft representation of heterogeneous data February 2022
sensors in edge computing nodes play an important role in the fusion
results, and they need to be protected from attacks in the whole
process.
4. IANA Considerations
This document has no actions for IANA.
5. Conclusions
In the process of industrial production, terminal data collected by
edge computing nodes vary in structure and form, including
structured, semi-structured and unstructured data. In order to mine
the valuable problems hidden behind these data, the corresponding
sub-tensors are constructed for different types of data through
tensor model, and then the tensor extension operator is used to
combine these sub-tensors to build the unified tensor model of
high-order big data.
6. Informative References
[1] W. Jianmin, "Survey on industrial big data", Big Data Research,
vol. 3, no. 6, pp. 3-14, 2017.
[2] Wentao, H. E. , and C. Shao, "The development and challenges of
industrial big data analysis technology", Information and
Control, vol. 47, no. 4, pp. 398-410, 2018.
[3] Kuang, L. , Hao, . , Yang, L. T. , Lin, M. , Luo, C. , and G.
Min, "A tensor-based approach for big data representation and
dimensionality reduction", IEEE Transactions on Emerging
Topics in Computing, vol. 2, no. 3, pp. 280-291, 2017.
7. Acknowledgments
TBD.
Authors' Addresses
Yaqian Zhang
Xi'an University of Posts & Telecommunications
Shaanxi
China
Email: zhangyaqian0701@126.com
Zhang, et al. Expires August 10, 2022 [Page 6]
Internet-Draft representation of heterogeneous data February 2022
Shengsheng He
Xi'an University of Posts & Telecommunications
Shaanxi
China
Email: 513286954@qq.com
Yanping Chen
Xi'an University of Posts & Telecommunications
Shaanxi
China
Email: chenyp@xupt.edu.cn
Zhongmin Wang
Xi'an University of Posts & Telecommunications
Shaanxi
China
Email: zmwang@xupt.edu.cn
Hong Xia
Xi'an University of Posts & Telecommunications
Shaanxi
China
Email: xiahong@xupt.edu.cn
Zhang, et al. Expires August 10, 2022 [Page 7]