Internet DRAFT - draft-xxx-operational-compute-metrics
draft-xxx-operational-compute-metrics
Network Working Group S. Randriamasy
Internet-Draft Nokia Bell Labs
Intended status: Informational L. M. Contreras
Expires: 25 April 2024 Telefonica
J. Ros-Giralt
Qualcomm Europe, Inc.
23 October 2023
Joint Exposure of Network and Compute Information for Infrastructure-
Aware Service Deployment
draft-xxx-operational-compute-metrics-00
Abstract
Service providers are starting to deploy computing capabilities
across the network for hosting applications such as AR/VR, vehicle
networks, IoT, and AI training, among others. In these distributed
computing environments, information about computing and communication
resources is necessary to determine both the proper deployment
location of each application and the best server location on which to
run it. This information is used by numerous different
implementations with different interpretations. This document
proposes an initial approach towards a common understanding and
exposure scheme for metrics reflecting compute capabilities.
About This Document
This note is to be removed before publishing as an RFC.
The latest revision of this draft can be found at
https://giralt.github.io/draft-xxx-operational-compute-metrics/draft-
xxx-operational-compute-metrics.html. Status information for this
document may be found at https://datatracker.ietf.org/doc/draft-xxx-
operational-compute-metrics/.
Source for this draft and an issue tracker can be found at
https://github.com/giralt/draft-xxx-operational-compute-metrics.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Randriamasy, et al. Expires 25 April 2024 [Page 1]
Internet-Draft TODO - Abbreviation October 2023
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on 25 April 2024.
Copyright Notice
Copyright (c) 2023 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents (https://trustee.ietf.org/
license-info) in effect on the date of publication of this document.
Please review these documents carefully, as they describe your rights
and restrictions with respect to this document. Code Components
extracted from this document must include Revised BSD License text as
described in Section 4.e of the Trust Legal Provisions and are
provided without warranty as described in the Revised BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Conventions and Definitions . . . . . . . . . . . . . . . . . 3
3. Problem Space and Needs . . . . . . . . . . . . . . . . . . . 3
4. Guiding Principles . . . . . . . . . . . . . . . . . . . . . 5
5. Related Work . . . . . . . . . . . . . . . . . . . . . . . . 6
6. GAP Analysis . . . . . . . . . . . . . . . . . . . . . . . . 7
7. Security Considerations . . . . . . . . . . . . . . . . . . . 7
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7
9. References . . . . . . . . . . . . . . . . . . . . . . . . . 7
9.1. Normative References . . . . . . . . . . . . . . . . . . 7
9.2. Informative References . . . . . . . . . . . . . . . . . 8
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 9
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9
1. Introduction
Operators are starting to deploy distributed computing environments
in different parts of the network with the objective of addressing
different service needs including latency, bandwidth, processing
capabilities, storage, etc. This translates in the emergence of a
number of data centers (both in the cloud and at the edge) of
different sizes (e.g., large, medium, small) characterized by
distinct dimension of CPUs, memory, and storage capabilities, as well
as bandwidth capacity for forwarding the traffic generated in and out
of the corresponding data center.
Randriamasy, et al. Expires 25 April 2024 [Page 2]
Internet-Draft TODO - Abbreviation October 2023
The proliferation of the edge computing paradigm further increases
the potential footprint and heterogeneity of the environments where a
function or application can be deployed, resulting in different
unitary cost per CPU, memory, and storage. This increases the
complexity of deciding the location where a given function or
application should be best deployed or executed. This decision
should be jointly influenced on the one hand by the available
resources in a given computing environment, and on the other hand by
the capabilities of the network path connecting the traffic source
with the destination.
Network and compute aware function placement and selection has become
of utmost importance in the last decade. The availability of such
information is taken for granted by the numerous service providers
and bodies that are specifying them. However, deployments may reach
out to data centers running different implementations with different
understandings and representations of compute capabilities and smooth
operation is a challenge. While standardization efforts on network
capabilities representation and exposure are well-advanced, similar
efforts on compute capabilitites are in their infancy.
This document proposes an initial approach towards a common
understanding and exposure scheme for metrics reflecting compute
capabilities. It aims at leveraging on existing work in the IETF on
compute metrics definitions to build synergies. It also aims at
reaching out to working or research groups in the IETF that would
consume such information and have particular requirements.
2. Conventions and Definitions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
3. Problem Space and Needs
Visibility and exposure of both (1) network and (2) compute resources
to the application is critical to enable the proper functioning of
the new class of services arising at the edge (e.g., distributed AI,
driverless vehicles, AR/VR, etc.). To understand the problem space
and the capabilities that are lacking in today's protocol interfaces
needed to enable these new services, we focus on the life cycle of a
service.
Randriamasy, et al. Expires 25 April 2024 [Page 3]
Internet-Draft TODO - Abbreviation October 2023
At the edge, compute nodes are deployed near communication nodes
(e.g., co-located in a 5G base station) to provide computing services
that are close to users with the goal to (1) reduce latency, (2)
increase communication bandwidth, (3) enable privacy/personalization
(e.g., federated AI learning), and (4) reduce cloud costs and energy.
Services are deployed on the communication and compute infrastructure
through a two-phase life cycle that involves first a service
_deployment stage_ and then a _service selection_ stage (Figure 1).
+-------------+ +--------------+ +-------------+
| | | | | |
| New +------> Service +------> Service |
| Service | | Deployment | | Selection |
| | | | | |
+-------------+ +--------------+ +-------------+
Figure 1: Service life cycle.
*Service deployment.* This phase is carried out by the service
provider, and consists in the deployment of a new service (e.g., a
distributed AI training/inference, an XR/AR service, etc.) on the
communication and compute infrastructure. The service provider needs
to properly size the amount of communication and compute resources
assigned to this new service to meet the expected user demand. The
decision on where the service is deployed and how many resources are
requested from the infrastructure depends on the levels of QoE that
the provider wants to guarantee to the user base. To make a proper
deployment decision, the provider must have visibility on the
resources available from the infrastructure, including communication
resources (e.g., latency and bandwidth) and compute (e.g., CPU, GPU,
memory, storage). For instance, to run a Large Language Model (LLM)
with 175 billion parameters, a total aggregated memory of 400GB and 8
GPUs are needed. The service provider needs an interface to query
the infrastructure, extract the available compute and communication
resources, and decide which subset of resources are needed to run the
service.
*Service selection.* This phase is initiated by the user, through a
client application that connects to the deployed service. There are
two main decisions that must be performed in the service selection
stage: compute node selection and path selection. In the compute
node selection step, as the service is generally replicated in N
locations (e.g., by leveraging a microservices architecture), the
application must decide which of the service replicas it connects to.
Similar to the service deployment stage, this decision requires
knowledge about communication and compute resources available in each
replica. On the other hand, in the path selection decision, the
application must decide which path it chooses to connect to the
Randriamasy, et al. Expires 25 April 2024 [Page 4]
Internet-Draft TODO - Abbreviation October 2023
service. This decision depends on the communication properties
(e.g., bandwidth and latency) of the available paths. Similar to the
service deployment case, the service provider needs an interface to
query the infrastructure and extract the available compute and
communication resources, with the goal to make informed node and path
selection decisions. It is also important to note that, ideally, the
node and path selection decisions should be jointly optimized, since
in general the best end-to-end performance is achieved by jointly
taking into account both decisions. In some cases, however, such
decisions may be owned by different players. For instance, in some
network environments, the path selection may be decided by the
network operator, wheres the node selection may be decided by the
application. Even in these cases, it is crucial to have a proper
interface (for both the network operator and the service provider) to
query the available compute and communication resources from the
system.
Table 1 summarizes the problem space, the information that needs to
be exposed, and the stakeholders that need this information.
+====================+===============+==========================+
| Action to take | Information | Who needs it |
| | needed | |
+====================+===============+==========================+
| Service placement | Compute and | Service provider |
| | communication | |
+--------------------+---------------+--------------------------+
| Service selection/ | Compute | Network/service provider |
| node selection | | and/or application |
+--------------------+---------------+--------------------------+
| Service selection/ | Communication | Network/service and/or |
| path selection | | application |
+--------------------+---------------+--------------------------+
Table 1: Problem space, needs, and stakeholders.
4. Guiding Principles
The driving principles for designing an interface to jointly extract
network and compute information are as follows:
P1. Leverage metrics across working groups to avoid reinventing the
wheel. For instance:
* RFC 9439 [I-D.ietf-alto-performance-metrics] leverages IPPM
metrics from RFC 7679.
Randriamasy, et al. Expires 25 April 2024 [Page 5]
Internet-Draft TODO - Abbreviation October 2023
* Section 5.2 of [I-D.du-cats-computing-modeling-description]
considers delay as a good metric, since it is easy to use in both
compute and communication domains. RFC 9439 also defines delay as
part of the performance metrics.
* Section 6 of [I-D.du-cats-computing-modeling-description] proposes
to represent the network structure as graphs, which is similar to
the ALTO map services in [RFC7285].
P2. Aim for simplicity, while ensuring the combined efforts don’t
leave technical gaps in supporting the full life cycle of service
deployment and selection. For instance, the CATS working group is
covering path selection from a network standpoint, while ALTO (e.g.,
[RFC7285]) covers exposing of network information to the service
provider and the client application. However, there is currently no
effort being pursued to expose compute information to the service
provider and the client application for service placement or
selection.
5. Related Work
Some existing work has explored compute-related metrics. They can be
categorized as follows:
* References providing raw compute infrastructure metrics:
[I-D.contreras-alto-service-edge] includes references to cloud
management solutions (i.e., OpenStack, Kubernetes, etc) which
administer the virtualization infrastructure, providing
information about raw compute infrastructure metrics.
Furthermore, [NFV-TST] describes processor, memory and network
interface usage metrics.
* References providing compute virtualization metrics: [RFC7666]
provides several metrics as part of the Management Information
Base (MIB) definition for managing virtual machines controlled by
a hypervisor. The objects there defined make reference to the
resources consumed by a particluar virtual machine serving as host
for services or applications. Moreover, [NFV-INF] provides
metrics associated to virtualized network functions.
* References providing service metrics including compute-related
information: [I-D.dunbar-cats-edge-service-metrics] proposes
metrics associated to services running in compute infrastructures.
Some of these metrics do not depend on the infrastructure behavior
itself but from where such compute infrastructure is topologically
located.
Randriamasy, et al. Expires 25 April 2024 [Page 6]
Internet-Draft TODO - Abbreviation October 2023
6. GAP Analysis
From this related work it is evident that compute-related metrics can
serve several purposes, ranging from service instance instantiation
to service instance behavior, and then to service instance selection.
Some of the metrics could refer to the same object (e.g., CPU) but
with a particular usage and scope.
In contrast, the network metrics are more uniform and
straightforward. It is then necessary to consistently define a set
of metrics that could assist to the operation in the different
concerns identified so far, so that networks and systems could have a
common understanding of the perceived compute performance. When
combined with network metrics, the combined network plus compute
performance behavior will assist informed decisions particular to
each of the operational concerns related to the different parts of a
service life cycle.
7. Security Considerations
TODO Security
8. IANA Considerations
This document has no IANA actions.
9. References
9.1. Normative References
[I-D.du-cats-computing-modeling-description]
Du, Z., Fu, Y., Li, C., Huang, D., and Z. Fu, "Computing
Information Description in Computing-Aware Traffic
Steering", Work in Progress, Internet-Draft, draft-du-
cats-computing-modeling-description-02, 23 October 2023,
<https://datatracker.ietf.org/doc/html/draft-du-cats-
computing-modeling-description-02>.
[I-D.ietf-alto-performance-metrics]
Wu, Q., Yang, Y. R., Lee, Y., Dhody, D., Randriamasy, S.,
and L. M. Contreras, "Application-Layer Traffic
Optimization (ALTO) Performance Cost Metrics", Work in
Progress, Internet-Draft, draft-ietf-alto-performance-
metrics-28, 21 March 2022,
<https://datatracker.ietf.org/doc/html/draft-ietf-alto-
performance-metrics-28>.
Randriamasy, et al. Expires 25 April 2024 [Page 7]
Internet-Draft TODO - Abbreviation October 2023
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/rfc/rfc2119>.
[RFC7285] Alimi, R., Ed., Penno, R., Ed., Yang, Y., Ed., Kiesel, S.,
Previdi, S., Roome, W., Shalunov, S., and R. Woundy,
"Application-Layer Traffic Optimization (ALTO) Protocol",
RFC 7285, DOI 10.17487/RFC7285, September 2014,
<https://www.rfc-editor.org/rfc/rfc7285>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/rfc/rfc8174>.
9.2. Informative References
[I-D.contreras-alto-service-edge]
Contreras, L. M., Randriamasy, S., Ros-Giralt, J., Perez,
D. A. L., and C. E. Rothenberg, "Use of ALTO for
Determining Service Edge", Work in Progress, Internet-
Draft, draft-contreras-alto-service-edge-10, 13 October
2023, <https://datatracker.ietf.org/doc/html/draft-
contreras-alto-service-edge-10>.
[I-D.dunbar-cats-edge-service-metrics]
Dunbar, L., Majumdar, K., Mishra, G. S., Wang, H., and H.
Song, "5G Edge Services Use Cases", Work in Progress,
Internet-Draft, draft-dunbar-cats-edge-service-metrics-01,
6 July 2023, <https://datatracker.ietf.org/doc/html/draft-
dunbar-cats-edge-service-metrics-01>.
[NFV-INF] "ETSI GS NFV-INF 010, v1.1.1, Service Quality Metrics", 1
December 2014, <https://www.etsi.org/deliver/etsi_gs/NFV-
INF/001_099/010/01.01.01_60/gs_NFV-INF010v010101p.pdf>.
[NFV-TST] "ETSI GS NFV-TST 008 V3.3.1, NFVI Compute and Network
Metrics Specification", 1 June 2020,
<https://www.etsi.org/deliver/etsi_gs/NFV-
TST/001_099/008/03.03.01_60/gs_NFV-TST008v030301p.pdf>.
[RFC7666] Asai, H., MacFaden, M., Schoenwaelder, J., Shima, K., and
T. Tsou, "Management Information Base for Virtual Machines
Controlled by a Hypervisor", RFC 7666,
DOI 10.17487/RFC7666, October 2015,
<https://www.rfc-editor.org/rfc/rfc7666>.
Randriamasy, et al. Expires 25 April 2024 [Page 8]
Internet-Draft TODO - Abbreviation October 2023
Acknowledgments
TODO acknowledge.
Authors' Addresses
S. Randriamasy
Nokia Bell Labs
Email: sabine.randriamasy@nokia-bell-labs.com
L. M. Contreras
Telefonica
Email: luismiguel.contrerasmurillo@telefonica.com
Jordi Ros-Giralt
Qualcomm Europe, Inc.
Email: jros@qti.qualcomm.com
Randriamasy, et al. Expires 25 April 2024 [Page 9]