Internet DRAFT - draft-li-coinrg-distributed-learning-architecture
draft-li-coinrg-distributed-learning-architecture
COINRG C. Li, H.Yang, Z. Sun
Internet-Draft Beijing University of Posts and Telecommunications
Intended status: Standards Track S. Liu
Expires: 20 July 2024 China Mobile Research Istitute
H. Zheng
Huawei Technologies
21 January 2024.
Distributed Learning Architecture based on Edge-cloud Collaboration
draft-li-coinrg-distributed-learning-architecture-02
Abstract
This document describes the distributed learning architecture based
on edge-cloud collaboration.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 20 July 2024.
Copyright Notice
Copyright (c) 2022 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Li, et al. Expires 20 July 2024 [Page 1]
Internet-Draft Distributed Learning Architecture January 2024
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3
2. Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1. Federated Learning . . . . . . . . . . . . . . . . . . . 3
2.2. Model Parallelism-Based Distributed Training . . . . . . 4
3. Problem Statement . . . . . . . . . . . . . . . . . . . . . . 4
4. Distributed Learning Architecture based on Edge-cloud
Collaboration. . . . . . . . . . . . .. . . . . . . . . . . . . 5
4.1. Model Splitting . . . . . . . . . . . . . . . . . . . . . 5
4.2. Distributed Learning Architecture based on Edge-cloud
Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . 5
5. Manageability Considerations . . . . . . . . . . . . . . . . 7
6. Security Considerations . . . . . . . . . . . . . . . . . . . 7
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7
8. References . . . . . . . . . . . . . . . . . . . . . . . . . 7
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 8
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8
1. Introduction
The rapid growth of Internet of Things (IoT) and social networking
applications has led to exponential growth in the data generated at
the edge of the network. The ability of a single edge node to process
data cannot meet the needs of IoT services. Edge-cloud collaboration
technology emerged as the times require, offloading some computing
tasks at the edge to the cloud. Service latency includes edge-side
computing latency and service transmission latency, which is crucial
to model quality in distributed training based on edge-cloud
collaboration, because it affects the synchronization of training.
How to ensure these two delays has become a key factor in improving
the quality of the model.
The distributed learning architecture based on edge-cloud
collaboration has become a solution to the above problems. The
training tasks are flexibly deployed to edge devices and cloud
devices through model parallelism, and deterministic network
technology is used to ensure uniform edge training delay and model
transmission delay, and then distributed training technology is used
to generate a unified model.
Li, et al. Expires 20 July 2024 [Page 2]
Internet-Draft Distributed Learning Architecture January 2024
1.1. Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in BCP
14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
2. Scenarios
In recent years, with the combination of edge computing and AI,
edge AI has gradually become a new means of intelligence
transformation due to its small traffic footprint, low latency, and
privacy features. Distributed edge model training can be the main
means to achieve edge intelligence.
2.1. Intelligent Transportation
Urban traffic intelligence has led to a growth in the variety of terminals
and a large rise in the demand for real-time processing of enormous volumes
of data.
For instance, traffic surveillance cameras, HD cameras at a single
intersection produce tens of gigabytes of video files every day. If
it's a street, a region, or even a city, the amount of data generated
is enormous, and the content of these videos that are actually useful
and need to be recorded for illegal activity only makes up a very small
portion of that data. By analyzing violations locally, performing
intelligent processing in the field, and filtering valuable content for
upload, edge AI systems can significantly reduce the amount of bandwidth
and storage used by incorrect content. It is challenging to train a
high-precision AI model due to the extremely restricted amount of
useful data that can be collected by a single computer and the extremely
constrained computing capacity of the computers themselves. The above
challenges can be effectively solved by the distributed collaborative
training method in this paper.
Li, et al. Expires 20 July 2024 [Page 3]
Internet-Draft Distributed Learning Architecture January 2024
2.2. Smart Factory
In the field of industrial manufacturing, edge AI will play an increasingly
important role in the development of smart factories. Driven by the Industry
4.0 model, smart factories will apply advanced robotics and machine learning
technologies to software services and industrial IoT to improve production
capacity and maximize productivity. Edge AI uses a variety of sensors to
control and manage commands, significantly improving control efficiency
and reducing errors. Edge AI computers can independently and autonomously
respond to inputs within milliseconds, either making adjustments to fix the
problem or immediately stopping the production line to prevent a serious
safety incident. However, the limited computing power of in-plant edge
computers makes it difficult to train models with high accuracy. The problem
can be effectively solved by federated learning and collaborative training.
3. Problem Statement
The computing power of edge nodes is small and cannot meet the model
training in the case of a large amount of data. Therefore, distributed
training based on edge-cloud computing power coordination has become
an important means to realize edge intelligence.
In order to obtain good training results, distributed training based
on edge-cloud collaboration requires the deterministic performance of
the underlying optical network. The synchronization of distributed
training is achieved through deterministic performance. At this time,
it is necessary to synchronize the edge training delay and model
transmission delay. These require the support of various quality
factors, such as computing resources, end-to-end delay, delay jitter,
bandwidth. The above factors can be achieved by deterministic optical
networks.
Li, et al. Expires 20 July 2024 [Page 4]
Internet-Draft Distributed Learning Architecture January 2024
4. Distributed Learning Architecture based on Edge-cloud Collaboration
At present, the common method is to realize the training of
distributed models by combining model splitting and distributed
training.
4.1. Model Splitting
Since each layer of an artificial intelligence model has independent
inputs and outputs, a model can be split into multiple sub-models for
independent training, where the training layer that links the sub-
models is called a segmentation layer. This method provides the
realization basis for edge-cloud collaborative training.
In order to maintain the synchronization in the data parallel process,
the training time of all edge nodes in this paper needs to be
consistent. Before the model is divided, the computing resources
required by each layer are first calculated, and the model is splited
according to the remaining situation of the current computing resources.
The model splitting in this document is dynamic, that is, the splitting
scheme of the model may be different for each round of training.
4.2.Distributed Learning Architecture based on Edge-cloud Collaboration
Edge devices provide services to nearby users, and collect data
generated in the process of providing services in real time to form
edge data sets. After the edge collects enough edge data, it sends a
model training request to the cloud node. After the cloud node
receives all training requests from the edge device, it prepares for
model training, which is divided into data standardization and model
determination. Model determination: The cloud node determines the
model architecture according to the training task and sends it to all
edge devices. In order to reduce the amount of computation in the
training process, the dataset needs to be standardized before
training. Common methods include normalization, log transformation,
and regularization. The data standardization method is determined by
the cloud node, and the standardized algorithm is sent to the edge
device, and the edge device processes the edge data set according to
the standardized algorithm.
Li, et al. Expires 20 July 2024 [Page 5]
Internet-Draft Distributed Learning Architecture January 2024
After the preparations are completed, enter the model training phase.
In order to ensure the quality of model training, it is necessary to
ensure the consistency of training delay in all edge devices and the
consistency of model transmission delay. Training delay and
transmission delay are set by cloud nodes based on historical
experience. At present,the computing power network calculation can
calculate the training time of the training task. Therefore, in terms of
training latency, the architecture of the model can be used to calculate
the floating-point operations of each layer of the model, which can be
extrapolated to calculate the training time of each layer of the model,
and then the number of layers to be trained by the edge device can
be determined based on the training latency. In the meantime, it is also
possible to determine the size of the data volume of the segmentation
layer, and then reserve bandwidth for segmentation layer in advance
based on the determined network technology. The edge device finishes
training the pre-training model, and after the training is completed,
sends the segmentation layer of the pre-training model to the cloud
node. After receiving the segmentation layer of the model, the cloud
node completes the subsequent training of the model, and then updates
the model weights according to the back-propagation algorithm. So far,
the edge device and the cloud node have completed a round of model
training. After every 5 rounds of training, all edge devices generate
a global model through distributed learning, and edge devices continue
to train according to the local model according to the global model
until the model accuracy meets the requirements.
Li, et al. Expires 20 July 2024 [Page 6]
Internet-Draft Distributed Learning Architecture January 2024
5. Manageability Considerations
TBD
6. Security Considerations
TBD
7. IANA Considerations
This document requires no IANA actions.
8. References
TBD
Li, et al. Expires 20 July 2024 [Page 7]
Internet-Draft Distributed Learning Architecture January 2024
Acknowledgments
TBD
Authors' Addresses
Chao Li
Beijing University of Posts and Telecommunications
Email: lc96@bupt.edu.cn
Hui Yang
Beijing University of Posts and Telecommunications
Email: yanghui@bupt.edu.cn
Zhengjie Sun
Beijing University of Posts and Telecommunications
Email: sunzhengjie@bupt.edu.cn
Sheng Liu
China Mobile
Email: liushengwl@chinamobile.com
Haomian Zheng
Huawei Technologies
Email: zhenghaomian@huawei.com
Li, et al. Expires 20 July 2024 [Page 8]