A Control Framework for Unified Optical Networks and AI Computing Orchestration (UONACO)

Introduction Distributed AI computing has become a dominant paradigm for delivering large-scale AI services, enabling providers to meet stringent performance and scalability requirements by leveraging geographically dispersed AI data centers (AIDCs). In such environments, the efficiency of distributed training, inference, and remote service access depends critically on tight coordination between optical transport networks and compute orchestration systems. However, today's infrastructure operates with fundamentally isolated control planes: the optical transport layer, despite providing the high-bandwidth, low-latency, and deterministic backbone for wide-area AI collaboration, remains blind to the dynamic, heterogeneous demands of AI workloads. It cannot discern whether a traffic flow stems from a bandwidth-intensive distributed training job requiring synchronized all-reduce operations across thousands of GPUs, or from a latency-critical inference request demanding sub-10ms end-to-end response. Consequently, optical networks provision static or best-effort lightpaths without adapting to the real-time compute intent, leading to underutilized spectral resources or, worse, congestion-induced stalls during critical gradient synchronization phases. Conversely, AI compute schedulers (e.g., Kubernetes-based orchestrators in AIDCs) make placement decisions based solely on local GPU/CPU availability and memory capacity, with no awareness of the underlying optical fabric's state, such as available wavelength continuity, end-to-end propagation delay, per-link bandwidth headroom, or even the presence of OXC-based reconfigurable paths. As a result, a training job may be split across geographically distant AIDCs with abundant but poorly interconnected GPU pools, causing prolonged communication phases and severe “compute efficiency loss.” Similarly, a low-latency inference service might be deployed in a remote AIDC simply because it has idle GPUs, even though the optical path violates the application's SLA due to high round-trip delay or lack of dedicated wavelength isolation. To address these challenges, this document introduces the Unified Optical Networks and AI Computing Orchestration (UONACO) framework. UONACO establishes a unified control architecture that enables bidirectional signaling, joint resource modeling, and synchronized orchestration between the compute and optical domains. The framework supports three representative service models: AI training, AI inference, and accessing remote AI inference services. By aligning network provisioning with compute intent—and vice versa—UONACO aims to improve the efficiency of wide-area collaborative AI computing.

Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here.

Service Model for AI Computing over Optical Network The deployment of wide-area AI services over optical infrastructure involves multiple stakeholders, each playing a distinct role in the end-to-end service delivery chain. To clarify responsibilities and interactions, this document defines a service model comprising the Customer, Service Provider, Network Provider, and Computing Power Provider.

Customer The Customer is the end user or enterprise that consumes AI capabilities. Three primary service patterns are observed: • In AI training, the customer delegates the training of large-scale AI models to service providers, typically specifying performance, scale, and data privacy requirements. • In AI inference, the customer leases computing resources to deploy and operate inference models, often serving downstream internet users with real-time or batch inference services. • In accessing remote AI inference service, the customer invokes pre-deployed inference APIs offered by third parties, expecting deterministic latency, reliability, and quality of service without managing underlying infrastructure.

Service Provider The Service Provider acts as the business orchestrator, interfacing directly with the Customer to translate high-level service intents—such as SLAs, geographic constraints, or performance targets—into concrete resource demands. It coordinates with both the Network Provider and the Computing Power Provider to fulfill these demands, and is responsible for service lifecycle management, billing, and customer support.

Network Provider The Network Provider operates and manages the underlying optical transport infrastructure. It delivers high-bandwidth, low-latency, and deterministic connectivity services, including inter-AIDC backbone links and user-to-AIDC dedicated access circuits. The Network Provider exposes network capabilities—such as available bandwidth, path latency, and reliability—through standardized control interfaces to enable coordinated service provisioning.

Computing Power Provider The Computing Power Provider owns and operates one or more Artificial Intelligence Data Centers (AIDCs). It offers compute, memory, and accelerator resources (e.g., GPUs, TPUs) for AI training and inference workloads. The Computing Power Provider reports real-time resource availability and performance metrics to the Service Provider and supports dynamic task placement and scaling based on orchestration instructions.

UONACO Control and Management Architecture

Overview As shown in Figure 1, the UONACO framework establishes a layered control architecture that enables end-to-end coordination between service intent, compute resources, and optical transport infrastructure. This architecture comprises five core functional components—Customer, Service Orchestrator (SO), Unified Compute-Optical Orchestrator (UCOO), Transport Network Controller (TNC), and Computing Power Scheduler (CPS)—interconnected through three standardized interfaces.

Service Orchestrator The SO serves as the business-facing interface of the UONACO framework. It is responsible for accepting AI service requests from customers—such as “deploy a distributed training job across multiple AIDCs with end-to-end latency under X ms” or “provision an inference service with Y GPU instances and guaranteed bandwidth”—and translating these intent-based specifications into structured resource requirements. The SO also handles service lifecycle management, including billing, SLA enforcement, and user authentication. It does not manage physical resources directly but instead communicates abstracted demands to the UCOO via the SUI interface.

Unified Compute-Optical Orchestrator The UCOO is the central coordination engine of the UONACO architecture. It receives service intents from the SO and continuously collects real-time telemetry from both the optical network (via TNC) and compute infrastructure (via CPS). Based on this global view, the UCOO executes joint optimization algorithms that consider both compute capabilities (e.g., GPU availability, memory) and network conditions (e.g., path latency, available bandwidth, congestion). The output of this decision process is a pair of synchronized instructions: one for optical path provisioning and another for compute task placement. The UCOO thus bridges the semantic and operational gap between the service layer and the infrastructure layer.

Transport Network Controller The TNC represents the control plane of the underlying optical transport infrastructure. It may encompass a hierarchy of controllers, including intra-domain optical controllers and inter-domain coordinators (e.g., multi-domain WSON or OXC orchestrators). The TNC is responsible for managing physical and virtual optical resources—such as wavelengths, time slots, fgOTN/OSU slices, and OXC cross-connects—and for executing path computation, signaling, and protection mechanisms. In the UONACO framework, the TNC exposes network topology, available capacity, and performance metrics to the UCOO through the UOI interface, and applies provisioning commands issued by the UCOO to establish, adjust, or release optical connections in response to compute workload dynamics.

Computing Power Scheduler The CPS acts as the controller for the AI compute pool, typically spanning one or more Artificial Intelligence Data Centers (AIDCs). It manages heterogeneous compute resources—including CPUs, GPUs, TPUs, memory, and storage—and reports their real-time availability, utilization, and performance characteristics (e.g., FLOPS, VRAM usage) to the UCOO. Upon receiving placement instructions from the UCOO via the UCI interface, the CPS schedules AI workloads (e.g., training jobs or inference containers) onto appropriate nodes, configures runtime environments, and ensures that compute tasks are aligned with the concurrently provisioned optical connectivity.

UONACO Interfaces The UONACO framework defines three key interfaces which have been shown in Figure 1, to enable interoperability and decoupled evolution of its components. SUI (SO-UCOO Interface): SUI connects SO and UCOO. Through this northbound interface, the SO conveys high-level service intent, including abstracted SLA requirements (e.g., maximum end-to-end latency, minimum bandwidth, geographic constraints), service type (e.g., AI training, inference, or remote access), and lifecycle events (e.g., service activation, modification, or termination). The UCOO interprets these intents as concrete resource demands and initiates joint optimization. The SUI thus serves as the bridge between business-oriented service definitions and infrastructure-aware orchestration. UOI (UCOO-TNC Interface): UOI links UCOO with TNC. This interface enables bidirectional communication: the UCOO sends optical resource requests specifying required connectivity attributes such as bandwidth, end-to-end latency bounds, path isolation level, and resilience requirements; in return, the TNC provides real-time network state updates, including topology, available wavelengths or time slots, link utilization, propagation delay, and fault status. By exposing network capabilities and constraints to the orchestration layer, the UOI allows the UCOO to make network-feasible decisions and enables the TNC to provision optical paths that are aligned with compute workload dynamics. UCI (UCOO-CPS Interface): UCI connects UCOO and CPS. Through this interface, the UCOO issues compute resource demands and task placement directives—such as the number and type of accelerators required, memory footprint, and preferred deployment topology—based on the outcome of joint compute-optical optimization. Conversely, the CPS reports real-time compute resource availability, node load, energy efficiency metrics, and task execution status (e.g., job progress, failure alerts). This feedback loop ensures that compute allocation respects both application requirements and the quality of the concurrently provisioned optical connectivity, thereby avoiding placements that would violate network SLAs. These interfaces are designed to be protocol-agnostic but are expected to leverage standardized, model-driven approaches (e.g., YANG/NETCONF or RESTCONF) to ensure vendor neutrality and scalability.

IANA Considerations TBD

Security Considerations TBD