Internet DRAFT - draft-du-computing-resource-representation
draft-du-computing-resource-representation
Network Working Group Z. Du
Internet-Draft Y. Fu
Intended status: Informational China Mobile
Expires: 12 January 2023 11 July 2022
Computing Resource Representation in Computing Aware Networking
draft-du-computing-resource-representation-01
Abstract
This document introduces the way of encoding service-specific
information and the way of signaling it in the network.
Requirements Language
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 12 January 2023.
Copyright Notice
Copyright (c) 2022 IETF Trust and the persons identified as the
document authors. All rights reserved.
Du & Fu Expires 12 January 2023 [Page 1]
Internet-Draft Computing Resource Representation in CAN July 2022
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Definition of Terms . . . . . . . . . . . . . . . . . . . . . 3
3. Requirements of Computing Resource Representation and
Signaling . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1. Requirements of Computing Resource Representation . . . . 4
3.2. Requirements of Computing Resource Signaling . . . . . . 4
4. Representation of Computing Information . . . . . . . . . . . 5
4.1. Representation of Computing Metric . . . . . . . . . . . 6
4.1.1. Representing in a Single value . . . . . . . . . . . 6
4.1.2. Representing in Multiple values . . . . . . . . . . . 6
4.2. Example Process of Computing Load Information . . . . . . 7
5. Signaling of Computing Information . . . . . . . . . . . . . 8
5.1. General Process of Informing . . . . . . . . . . . . . . 8
5.2. BGP Method in Informing . . . . . . . . . . . . . . . . . 9
5.3. Other Methods in Informing . . . . . . . . . . . . . . . 10
6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 10
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10
8. Security Considerations . . . . . . . . . . . . . . . . . . . 10
9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 11
10. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 11
11. References . . . . . . . . . . . . . . . . . . . . . . . . . 11
11.1. Normative References . . . . . . . . . . . . . . . . . . 11
11.2. Informative References . . . . . . . . . . . . . . . . . 11
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12
1. Introduction
Traditionally, the network can only do traffic engineering according
to the network statuses. As the trend of computing and network
convergence, some works are proposed for network to be aware of
service information, and can make a better choice in the traffic
steering accordingly. Computing Aware Networking (CAN) could steer
the traffic based on both the network and computing statuses, which
is considered as a mechanism for computing and network convergence.
Du & Fu Expires 12 January 2023 [Page 2]
Internet-Draft Computing Resource Representation in CAN July 2022
In the traditional network architecture, the network is only
responsible for delivering packets between servers and clients, and
is not aware of the computing information.
[I-D.liu-dyncast-ps-usecases] and [I-D.liu-dyncast-reqs] show that,
when service instances are deployed at multiple geographical edge
sites, CAN would achieve service equivalence and load balancing by
considering both the service metrics and network metrics.
However, the method of notifying the service metrics in the network,
representation of computing resources, and signaling of computing
resource to the network are still uncertain, which is important for
the network domain to know about the computing domain.
This document dose further explorations on the way of service metrics
encoding and signaling. Some requirements about the service metric
representation and signaling can be found in the document
[I-D.liu-dyncast-gap-reqs].
2. Definition of Terms
This document makes use of the following terms:
Computing-Aware Networking (CAN): Aiming at computing and network
resource optimization by steering traffic to appropriate computing
resources considering not only routing metric but also computing
resource metric and service affiliation.
Service: A monolithic functionality that is provided by an endpoint
according to the specification for said service. A composite
service can be built by orchestrating monolithic services.
Service instance: Running environment (e.g., a node) that makes the
functionality of a service available. One service can have several
instances running at different network locations.
Service identifier: Used to uniquely identify a service, at the same
time identifying the whole set of service instances that each
represent the same service behavior, no matter where those service
instances are running.
Computing capacity: The ability of nodes with computing resource
achieve specific result output through data processing,
specifically including computing, communication, memory and storage
capacity.
3. Requirements of Computing Resource Representation and Signaling
Du & Fu Expires 12 January 2023 [Page 3]
Internet-Draft Computing Resource Representation in CAN July 2022
3.1. Requirements of Computing Resource Representation
The CAN needs to obtain the computing information of the computing
resource for a service, to realize the traffic steering considering
both network and computing status. As described in
[I-D.liu-dyncast-reqs], the representation and encoding of computing
metric is crucial, which is conveyed to CAN system to support the CAN
components to act upon. The representation needs to express the
capabilities of computing resources accurately, and the CAN system
must agree on the service-specific metrics and their representation
between service elements in the participating edges for the CAN
components to act upon them.
Moreover, the computing resource representation need to consider the
computing modeling as the requirements described in
[I-D.liu-can-computing-resource-modeling]:
Support the representation of computing resources in multiple
dimensions, including computing capacity, communication capacity,
cache capacity and storage capacity.
Support the representation of the computing capacity in chip
category, such as CPU, GPU, FPGA, ASIC, and in computing type, such
as int calculation, float calculation and hash calculation.
3.2. Requirements of Computing Resource Signaling
The representation results of computing resources need to be exposed
in the network to support the efficient utilizing of computing
resources, or joint utilizing of both computing resources and network
resources as describe in [I-D.liu-dyncast-reqs]. CAN aims at dynamic
scenarios of which the status of computing resources may vary
frequently, e.g., changing with the number of sessions, CPU/GPU
utilization and memory space. More frequent distribution of more
accurate synchronization of the real-time representation of computing
resources may result in more overhead in terms of signaling. Thus,
the signaling of computing resources needs to distribute and
synchronize the real-time representation of computing resources
efficiently to reduce the unnecessary signaling and meet the service
requirements. The requirements contain several aspects as described
below.
Support to signal various message based on the representation of
computing resources.
Support to control the signaling rate, such as define at what
interval or events to signal the information of computing resources.
Du & Fu Expires 12 January 2023 [Page 4]
Internet-Draft Computing Resource Representation in CAN July 2022
Support to signal the updated information of computing resources.
Support to implement mechanisms for loop avoidance in signaling
metrics, when necessary.
4. Representation of Computing Information
The main job of the network is to forward the packets of the users
from the source to the destination, while the main job of the
computing is to complete the various tasks of the users.
The network metrics include the bandwidth, latency, jitter, etc.
They can describe the capabilities of the network, and are
independent of the detailed realization of the underlayer
technologies, such as the mode of the optical fiber, or the structure
of a switch.
The computing metrics are more complex, which is hard to match the
QoS/QoE. For example, if the task is the AI computing, such as the
image processing, the computing resource can be measured by using
FLOPS (Floating-point Operations Per Second) or TFLOPS (Tera FLOPS).
However, it is more difficult to get the process time, which will be
influenced by the current utilization rate of CPU, cache, and so on.
Even some real-time OS or protocol are used, sometimes it will fail
because of the deadlock or other mechanisms of OS.That is not to say
there is any problem with the OS, but the complex environment in it.
So, the service metric will consider more factors to judge the
performance, and how to be used in another domain to guarantee the
E2E service quality.
[I-D.liu-can-computing-resource-modeling] proposes a basic
architecture of computing resource modeling, which considers the
computing hardware types, computing task types, communication, cache,
storage status, and uses the vector to represent the basic result of
modeling. The vector could be:
a group of multiple vectors, to represent the evaluated level of
computing, communication, cache, and storage capacity.
a single vector, to represent the single comprehensive level of
overall capacity.
How to use the vector depends on the specific application domain.
For the network, to preserve the metadata privacy of computing
domain, usually, weighted or fuzzy processing methods are used.
Du & Fu Expires 12 January 2023 [Page 5]
Internet-Draft Computing Resource Representation in CAN July 2022
4.1. Representation of Computing Metric
How to use the vector depends on the specific application demands.
To preserve the metadata privacy of computing domain, usually, the
weighted or fuzzy processing methods are used by CAN.
Based on [I-D.liu-can-computing-resource-modeling], to use the
information of computing resource for network, we can use two general
ways to represent them. One is to use single vector to represent the
level, the other is to use a group of vectors to represent more
detailed information.
4.1.1. Representing in a Single value
At one aspect, we can offer a general computing load information to
the ingress nodes. As an example, we perhaps only need to three
values:
one red value stands for the busy status,
one yellow value stands for relatively busy status,
one green value stands for free status.
Therefore, the ingress node only needs to consider the yellow edge
sites and green edge sites when steering traffic, in which the green
ones are more preferred.
4.1.2. Representing in Multiple values
At the other aspect, we can also offer detailed computing related
information but also are expected to be the weighted value as
described in [I-D.liu-can-computing-resource-modeling], such as
computing capacity information includes chips category and computing
task category, communication information, cache information and
storage information.
Moreover, some additional information could also be represented if
needed:
the service information deployed on edge sites, for example, Service
ID,
the maximum session number that the edge sites can provide,
the current session number that the edge sites can provide,
the available computing infrastructure of the server, etc.
Du & Fu Expires 12 January 2023 [Page 6]
Internet-Draft Computing Resource Representation in CAN July 2022
Those information may be optional and encoded as TLVs. A specific
service may have a specific preferred set of TLVs. For example, if
multiple instances have the same free status, the additional TLVs
could be used to represent the computing resources. The detailed
decision algorithm is out of scope of this document.
The informing of the TLVs should be service-specific and on-demand.
Different services may care about or have subscribed different sets
of TLVs. Besides, if an Ingress node receives any TLV that it does
not support, the Ingress node can just ignore it.
4.2. Example Process of Computing Load Information
For a specific service, we can offer both a general computing load
information and some more specific information about the computing.
A general process about it is described as below.
Step1: The service instances are deployed in multiple edge sites.
The ingress nodes of network working as the load balancing point
needs to obtain the computing information. The service should have a
specific SID, for example SID1, in the network, so that the ingress
node can recognize and treat the service request differently
according to SID.
Step2: After obtaining the computing information of a service related
to ServiceID1 from multiple edge sites, the ingress nodes should
record the computing information. Meanwhile, an ingress node should
also be able to obtain network status, for example the latency to the
egress of an edge site and record it.
Step3: An ingress node receives a packet targeted to the ServiceID1.
According to the service metrics and network metrics it has recorded,
the ingress node makes a decision about which edge site to use and
forward the packet to the related egress. The selection method may
be depended on the service. For example, it may be the one with the
lowest latency among the ones that can offer the service, or the one
with the best computing resource among the ones that have a latency
fulfilling the service requirements, or a hybrid method.
The purpose of the procedure is to find an edge site that is
relatively near to the client, and also have enough computing
resource for the service. However, the edge sites that provide the
service may be various, and perhaps have different computing
abilities. Therefore, a load balancing method considering the
computing resource is useful in this scenario.
Du & Fu Expires 12 January 2023 [Page 7]
Internet-Draft Computing Resource Representation in CAN July 2022
5. Signaling of Computing Information
The target of CAN is to steer traffic considering both network and
computing resource status. To meet the use case demands in
[I-D.liu-dyncast-ps-usecases], an "on-path" decision is expected.
For instance, the Ingress of the network works as the decision point
to steer the traffic of the users. In this situation, the Ingress
needs to know the computing information of the service instance,
which could be behind the Egress. Among the computing information,
some are relatively static, and some are dynamic. They may be
delivered by using different means, and at different frequencies.
Besides of the computing resource modeling and computing resource
representation, CAN should also focus on how to deliver the computing
information from the Egress to the Ingress.
5.1. General Process of Informing
For the signaling of the computing information, a general process
about it is described as below.
Step1: The gateway of the edge site collects the computing status
information of the specific service instance or a categorized
service. In some cases, there will be the controller in the edge
site, which can help to collect the information and notify the
gateway.
Step2: The Egress of CAN receives the service status information from
the gateway of the edge site and notify the CAN ingress nodes.
In the first step, the controller or the gateway perhaps can
communicate by PCE or other protocol for the controller. In the
second step, the controller-based method can also be used; however,
communications between the controller of the edge site and the
controller of the network may be complicated and inefficient.
In the following section, we propose some potential ways to notify
computing information, including the BGP extension, and others
potential methods. When we are notifying that the edge sites have
the service, i.e., a binding address for the service and the
corresponding route to it, we can add additional computing
information in its Extended Community.
Du & Fu Expires 12 January 2023 [Page 8]
Internet-Draft Computing Resource Representation in CAN July 2022
5.2. BGP Method in Informing
As the informing of the computing information is for the edge network
nodes, we can consider using BGP, specifically the MP-BGPRFC 4760
[RFC4760] . BGP is a gateway protocol that enables the network to
exchange routing information between Autonomous Systems (AS). MP-BGP
allows VPN edge nodes to exchange client information via different
underlay networks (e.g., MPLS). As said before, we can add the
computing information in the Extended Community.
When we notify the route for the specific service (naming as
ServiceID1) whose address is an anycast address, in a BGP UPDATE
message, the route can include many Path Attributes. The Extended
Community is one of the Attributes defined in RFC 4360 [RFC4360].
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Flag | Status | Sub-tlvs |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 1. Format of the Computing Information in BGP
Type: TBD, for example, 0x0314.
Length: This refers to the total length in octets of the element
excluding the Type and Length fields.
Flag: all zero.
Status: the first two bits are used.
Sub-tlvs: the sub-tlvs related to computing information.
One example of the Sub-tlvs is that the value of FLOPS that is widely
used in the AI analysis scenarios. For some services that need a
large amount of computing resources, we can also provide a general
computing grade information of a server, such as large, middle, or
small.
Besides the computing information, BGP can also be extended to
exchange some other information for the CAN. While notifying the
load of the computing, the network can also monitor the whole load
balancing system. If any service becomes heavy load, i.e., all the
service instances for the service are busy, the network should be
able to inform potential inactive service points to join in the LB.
Similarly, if any service becomes light-load, i.e. all the service
Du & Fu Expires 12 January 2023 [Page 9]
Internet-Draft Computing Resource Representation in CAN July 2022
instances for the service are relaxed, the network should also be
able to inform one active service points to become inactive to
release the resource to other services.
What needs to be considered more is the update frequency. The UPDATE
message is sent when the network topology, path, or other status
change, not cyclical. There should be a match mechanism of the
computing status change of edge sites, considering the effectiveness
for a given period of time, and preventing the overload of network
caused by the notification of network status update, for instance, a
set threshold.
5.3. Other Methods in Informing
The computing information can be treated similarly to the OAM
(Operations, Administration and Maintenance) information in the
network. Therefore, it should also be able to be carried in the OAM
message with some proper extensions to current OAM mechanisms.
Therefore, the load balancing point can collect network information
via OAM mechanisms, and it can collect computing information via OAM
mechanisms.
Some network programming mechanisms such as SRv6 can also be
considered here. The computing information can be carried in some
places of the IPv6 extension headers. For example, some data packets
from the Egress to the Ingress can carry the computing information.
The insertion of the computing information can take place on the
Egress. It can be on-demand or periodically.
Besides BGP, OAM and network programming mechanisms, if needed, the
CAN specific methodology of computing information notification could
also be further formulated.
6. Conclusion
This document analyzes the requirements of computing representation
and signaling, proposing some potential method to achieve them, which
are the key functions of CAN.
7. IANA Considerations
TBD.
8. Security Considerations
TBD.
Du & Fu Expires 12 January 2023 [Page 10]
Internet-Draft Computing Resource Representation in CAN July 2022
9. Acknowledgements
TBD.
10. Contributors
The following people have substantially contributed to this document:
Linda Dunbar
11. References
11.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC4360] Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended
Communities Attribute", RFC 4360, DOI 10.17487/RFC4360,
February 2006, <https://www.rfc-editor.org/info/rfc4360>.
[RFC4760] Bates, T., Chandra, R., Katz, D., and Y. Rekhter,
"Multiprotocol Extensions for BGP-4", RFC 4760,
DOI 10.17487/RFC4760, January 2007,
<https://www.rfc-editor.org/info/rfc4760>.
11.2. Informative References
[I-D.liu-can-computing-resource-modeling]
Liu, P., Du, Z., Rui, L., Li, W., Li, C., and G. Huang,
"Computing Resource Modeling for CAN", Work in Progress,
Internet-Draft, draft-liu-can-computing-resource-modeling-
00, 11 July 2022, <https://www.ietf.org/archive/id/draft-
liu-can-computing-resource-modeling-00.txt>.
[I-D.liu-dyncast-gap-reqs]
Liu, P., Jiang, T., Eardley, P., Trossen, D., and C. Li,
"Dynamic-Anycast (Dyncast) Gap analysis and Requirements",
Work in Progress, Internet-Draft, draft-liu-dyncast-gap-
reqs-00, 8 July 2022, <https://www.ietf.org/archive/id/
draft-liu-dyncast-gap-reqs-00.txt>.
[I-D.liu-dyncast-ps-usecases]
Liu, P., Eardley, P., Trossen, D., Boucadair, M.,
Contreras, L. M., and C. Li, "Dynamic-Anycast (Dyncast)
Use Cases and Problem Statement", Work in Progress,
Du & Fu Expires 12 January 2023 [Page 11]
Internet-Draft Computing Resource Representation in CAN July 2022
Internet-Draft, draft-liu-dyncast-ps-usecases-03, 7 March
2022, <https://www.ietf.org/archive/id/draft-liu-dyncast-
ps-usecases-03.txt>.
[I-D.liu-dyncast-reqs]
Liu, P., Jiang, T., Eardley, P., Trossen, D., and C. Li,
"Dynamic-Anycast (Dyncast) Requirements", Work in
Progress, Internet-Draft, draft-liu-dyncast-reqs-02, 7
March 2022, <https://www.ietf.org/archive/id/draft-liu-
dyncast-reqs-02.txt>.
Authors' Addresses
Zongpeng Du
China Mobile
No.32 XuanWuMen West Street
Beijing
100053
China
Email: duzongpeng@foxmail.com
Yuexia Fu
China Mobile
No.32 XuanWuMen West Street
Beijing
100053
China
Email: fuyuexia@chinamobile.com
Du & Fu Expires 12 January 2023 [Page 12]