<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<!-- generated by https://github.com/cabo/kramdown-rfc version 1.7.31 (Ruby 3.2.3) -->
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" ipr="trust200902" docName="draft-intellinode-cats-in-network-scheduling-00" category="info" consensus="true" submissionType="IETF" tocInclude="true" sortRefs="true" symRefs="true" version="3">
  <!-- xml2rfc v2v3 conversion 3.31.0 -->
  <front>
    <title abbrev="IntelliNode">IntelliNode: In-Network Intelligent Scheduling Extensions for CATS</title>
    <seriesInfo name="Internet-Draft" value="draft-intellinode-cats-in-network-scheduling-00"/>
    <author fullname="Teng Gao">
      <organization>Peng Cheng Laboratory</organization>
      <address>
        <email>gaot@pcl.ac.cn</email>
      </address>
    </author>
    <date year="2026" month="March" day="01"/>
    <area>Routing</area>
    <workgroup>Computing-Aware Traffic Steering</workgroup>
    <keyword>AI Networks</keyword>
    <keyword>In-Network Scheduling</keyword>
    <keyword>Tensor</keyword>
    <keyword>RoCEv2</keyword>
    <abstract>
      <?line 36?>

<t>This document introduces IntelliNode, an in-network intelligent scheduling mechanism built upon the Computing-Aware Traffic Steering (CATS) framework. Modern large-scale AI training and inference heavily rely on distributed heterogeneous clusters (GPU/CPU/FPGA). However, existing networks lack awareness of tensor semantics, training phases, and heterogeneous computing capabilities, leading to high communication latency, low resource utilization, and pipeline stalls.</t>
      <t>IntelliNode shifts away from the traditional passive scheduling paradigms that rely on probes and controllers. By bypassing traditional paths and integrating FPGAs alongside programmable Switch ASICs, it constructs a rapid data-plane closed loop of "Perception-Inference-Decision-Execution". This architecture performs feature extraction at line rate, leverages lightweight prediction models to infer short-term network behavior, and drives real-time heuristic scheduling decisions (e.g., path selection, tensor slicing, and compute matching). This document defines the four core functional layers and extension signaling that support this architecture, laying the foundation for an AI-native, scalable distributed computing network.</t>
    </abstract>
  </front>
  <middle>
    <?line 42?>

<section anchor="introduction">
      <name>Introduction</name>
      <t>The CATS framework primarily addresses the selection of service instances and computing-aware traffic steering in general distributed systems. However, when confronting large-scale AI training, distributed inference, and heterogeneous computing clusters, AI workloads exhibit traffic dynamics on the order of microseconds to milliseconds, accompanied by highly diverse tensor types (e.g., gradients, activations, parameters).</t>
      <t>Traditional CATS models (control-plane decisions combined with service-level steering) are inadequate for these next-generation computing workloads. IntelliNode proposes an extended architecture deeply embedded in the data plane. It not only natively processes RoCEv2 protocol semantics but also transforms the network from a "passive data pipe" into an "active, computing-aware collaborative engine".</t>
    </section>
    <section anchor="problem-statement">
      <name>Problem Statement</name>
      <t>Applying existing network scheduling mechanisms to AI training and heterogeneous computing networks reveals the following fundamental limitations:</t>
      <ul spacing="normal">
        <li>
          <t>Tensor Semantic Blind Spot: Existing mechanisms cannot distinguish specific semantics within data streams, such as gradients, activations, or parameter updates.</t>
        </li>
        <li>
          <t>Lag in End-to-End Feedback: Mechanisms like ECN misinterpret "passive feedback" as "active scheduling" and assume that "rate reduction is the correct response." This completely ignores the computing semantics in AI inference, where certain flows "cannot slow down, but must wait or degrade precision."</t>
        </li>
        <li>
          <t>Excessive Control-Plane Decision Latency: Control-plane routing updates, which take hundreds of milliseconds to seconds, cannot handle transient congestion or iterative bursts within a 1-5 millisecond window.</t>
        </li>
        <li>
          <t>Conflict between Isomorphic Assumptions and Heterogeneous Reality: In cross-domain computing networks, node capabilities are highly uneven (e.g., GPU/FPGA hybrids). The network must possess global state awareness to accurately match computing power with communication workloads rather than relying on isomorphic computing assumptions.</t>
        </li>
      </ul>
    </section>
    <section anchor="architecture">
      <name>Architecture</name>
      <t>The IntelliNode architecture consists of four tightly coordinated functional layers that perfectly align with CATS's abstractions for information collection, decision engine, and steering plane. This architecture fuses the capabilities of programmable switches, FPGAs, and CPUs at the local node, enabling a microsecond-level closed loop without interrupting the packet forwarding path.</t>
      <section anchor="feature-extraction-layer-switch-asic">
        <name>Feature Extraction Layer (Switch ASIC)</name>
        <t>Deployed on Tofino-class programmable switch ASICs, this layer actively participates in RoCEv2 traffic management. It maintains a high-performance Queue Pair (QP) flow state machine. The switch collects and parses real-time features at line-rate, including:</t>
        <ul spacing="normal">
          <li>
            <t>Basic Network Features: Ingress port, transmission rate, flow size, queue depth, and link utilization.</t>
          </li>
          <li>
            <t>AI Semantic Features: Tensor type (gradient / activation / normal traffic), tensor position within a batch/iteration, the stage of the model-parallel pipeline, and whether it is cross-node gradient-sync traffic.</t>
          </li>
          <li>
            <t>Flow State Classification: The hardware identifies the flow's current state as UNALLOCATED, SMALL_FLOW (delay-sensitive/control), LARGE_FLOW (high-bandwidth parameter synchronization), or DRAINING (tail-end flushing).</t>
          </li>
        </ul>
        <t>These features are extracted, normalized, and encoded at line-rate, then written into a high-speed featureFIFO to be sent directly to the onboard FPGA. Simultaneously, the pipeline incorporates real-time checksum updates and validation logic for mutable fields in RoCEv2 (such as ECN and TTL markings) to ensure protocol legitimacy.</t>
      </section>
      <section anchor="state-prediction-layer-fpga">
        <name>State Prediction Layer (FPGA)</name>
        <t>The FPGA reads features from the featureFIFO and executes an ultra-low latency, lightweight prediction model (e.g., State-GNN based on Graph Neural Networks, linear regression, or heuristic models). This layer focuses on predicting short-term network and load states within the next 1-5 milliseconds (ms):</t>
        <ul spacing="normal">
          <li>
            <t>Network State Prediction: Imminent congestion risks on switch ports and the available bandwidth of candidate routing paths in the next window.</t>
          </li>
          <li>
            <t>Computing Load Prediction: The arrival time of the next batch of periodic tensor traffic, and the probability of queuing backlogs or pipeline stalls at the downstream GPU.</t>
          </li>
        </ul>
        <t>These forward-looking prediction fields serve as the core input for the subsequent scheduling engine.</t>
      </section>
      <section anchor="heuristic-scheduling-layer">
        <name>Heuristic Scheduling Layer</name>
        <t>The scheduling engine integrates the currently extracted AI semantics with the predicted states output by the FPGA, approximating Pareto Optimality amidst conflicting multi-objective goals (e.g., computing latency vs. communication overhead). The decision logic is based on:</t>
        <ul spacing="normal">
          <li>
            <t>Tensor type and structural priority.</t>
          </li>
          <li>
            <t>Operator dependency.</t>
          </li>
          <li>
            <t>Heterogeneous computing capabilities of target nodes.</t>
          </li>
          <li>
            <t>1-5ms network and congestion predictions.</t>
          </li>
        </ul>
        <t>The decision outputs (execution actions) include:</t>
        <ul spacing="normal">
          <li>
            <t>Path and Priority: Outputs the optimal path set. SMALL_FLOWs are prioritized based on arrival rate, while LARGE_FLOWs dynamically allocate bandwidth based on target computing power using a Weighted Deficit Round Robin (WDRR) policy.</t>
          </li>
          <li>
            <t>Tensor Slicing: Determines if tensor slicing is necessary, defining the number of slices and the independent routing path for each.</t>
          </li>
          <li>
            <t>Multipath Aggregation: Decides whether to enable data-plane multipath aggregation.</t>
          </li>
          <li>
            <t>In-Network Offloading: Decides whether to offload specific operators (e.g., Sum/Reduce) to in-network FPGAs or edge nodes.</t>
          </li>
        </ul>
      </section>
      <section anchor="steering-plane">
        <name>Steering Plane</name>
        <t>The output of the heuristic scheduling must be applied to the entire network data plane via a lightweight signaling mechanism (potentially as an extension to CATS-SR or CATS-Overlay):</t>
        <ul spacing="normal">
          <li>
            <t>Control Plane Interface: Triggers the local CPU to update the routing/forwarding tables, applying the latest policies to the next batch of traffic automatically.</t>
          </li>
          <li>
            <t>Data Plane Labels / TLVs: Pushes extended Metadata TLVs carrying tensor types, training phases, and compute resource requests into the packet header.</t>
          </li>
          <li>
            <t>Fragment Routing Encapsulation: Provides necessary Encapsulation information for traffic requiring Tensor Slicing.</t>
          </li>
        </ul>
      </section>
    </section>
    <section anchor="security-considerations">
      <name>Security Considerations</name>
      <t>Given that IntelliNode introduces granular TLV fields for tensor semantics and active data-plane scheduling, the system MUST:</t>
      <ul spacing="normal">
        <li>
          <t>Provide integrity protection for TLV fields to prevent malicious nodes from tampering with "Tensor Types" to preempt high-priority queues.</t>
        </li>
        <li>
          <t>Introduce encrypted control-plane channels for telemetry and configuration.</t>
        </li>
        <li>
          <t>Implement authentication to prevent unauthorized nodes from falsely broadcasting their Compute-Capability within the computing network.</t>
        </li>
      </ul>
    </section>
    <section anchor="iana-considerations">
      <name>IANA Considerations</name>
      <t>This document requests that IANA allocate new TLV types for AI-native CATS deployments, including but not limited to:</t>
      <ul spacing="normal">
        <li>
          <t>TENSOR_TYPE TLV</t>
        </li>
        <li>
          <t>TRAINING_PHASE TLV</t>
        </li>
        <li>
          <t>COMPUTE_CAPABILITY TLV</t>
        </li>
        <li>
          <t>PATH_PREDICTION TLV</t>
        </li>
      </ul>
    </section>
  </middle>
  <back>








  </back>
  <!-- ##markdown-source:
H4sIAAAAAAAAA5VZ72/kuJH9bsD/A9H7IfbBsicb3H0wDkF6PZ4ZAx67Y/fe
IsACC7bE7mZGEhVSsqfz1997RVJSe2YvlwDxTusHWax69epVqSiK05Pe9rW5
Vou7tjd1bR9chV93bfFg+lfnv6h0fWfaXj2Xe1MNtW136vZrb9pgXRvU1nl1
s1w/L05P9Gbjzcvxarhc6t7snD9cK9tu3enJ6UnlylY32KnyetsXNj7e4vEC
DwdcKNpoQBHGTYt3705PwrBpbODO/aGjqbfrD6cn7dBsjL/GwtjqWv347sf/
Kt79qXj3R2wOG2HqEK5V7wdzegL7/gRTvdEw9MkNPZaGkdxs593Q4eqNazq5
Xixf8aBaw8qtLdVzb4yPj38xB7xRYUtVqOWdSv4K8nvmv8lncmcNU5yXfz65
m9uXH+mNF9MOBiv9G9vHwy9+wRYMx0e+icuNtjUu/zed+Bdr+u2l87s/MzK+
3PPGvu+7cH11xQd5zb6Yy/zcFS9cbbx7DeaKK1zxTbw79HtH5wIuCv/bDnUd
o7fAcbC5dot4B4vo1v5T9wjPtVrx5s2ef+/1xnndAwPxQZMM3WnX/6Ur60td
Xpat7NY632CBFzrk9ISAmf8uikLpTei9Lnv+Xu9tUEDT0BCgwJF31VCaoGYI
vFC6VROilJ1BekKXaky5h/WhUZvB1r0aOteqfm/UvwqHOiP8z9XWwync4VJ9
xra+VXDxzgDBujbECKy2LV/QbcVUMN60pVF7o19sfVDe4A/2rCzOZzdDbyrc
6413MNW4IaiyHgJ+B3X2cfXz1Q3+/2H1cXl+qT65V/Ni/IUyX/Eyt0iHDbCh
/KI0DW9NCMptVS8YVAFBaHtbhovJsG6vgwkXYuCbrbMTVKk7vbG17S2frI2u
eLV3am93ez7XDK0tBQPYHJuVBzzmXnG+4AaPA2OdOqEkbtXZziAGRoVe13W4
ZGRnAVRhb7d94CkO8LJrJCwwurJcQ9eq06CEFzMPZ6d5f9cEPKz70bmddxvA
g7uCGQCXuoZDL9VPB7U5yDI8zNHS/T6kiIHGgGI+Qb/jau3aXbCwEMviVtPo
DUL9/Gr7cq+Wz3c38JDtuRNCOpQ8g/K6s5UCUemiqzUOXdYuINS1cx3Ds1gZ
OKnj9sVdBknx3pSWrFfcfjXlwJuLSyXolzTuTdkPwGZnPBMGrGy0XDBfJVcY
DHhBnIwjGAYOgNE7+AKpsO9fDf/iHKay8fEGnq8DAytQRRCc7wtgosngUhuz
B3Sdj1GsPEIQ4GldF71tCOzBE47lPC5VOghAbC53lxfiYICxNmUERIZnbUs8
f5FCRfQZBSbAYdvdeTr7mPmV2eJoQYCxBcrwBg6/HdoyhbHWByYOFzO5dqlg
d7gnESdGwtB1OCN+vHHrBV+Pj8nybRXhzdIHblneFa1Q1IVirgsG5kk85U5y
3GWmssZWVW346wcylpAXV47UZqSuTryC6NhGe3KFriqkU0gnHr1H/ATjXyyy
zAJzui1HsGcSEy4gxIXEQiYx2yrmuoev5raHA55owoxiXsHoRDQSsZVD/Q7L
XRytM9LdvyCXxHAXXIlnrp2uAkK2txskUra6OqAAgbpUomhUYiAUh8dFj2yC
eZVAt7HgkPQbO5fcCSQPgzYHISz4skLkfDAZeCytIzh3ZAIATF5GiCXu4ULY
peEZwrkEcz1jDAlayp6zRDIp1Sfww5ANEFspcMU+x6xgUtZjTM4VI2VbXZl/
DEhawRuOC1tbgLiI8ZK4Tx4cnXY5r4EkqM4FAUNMgAp7H1FHZUwHbxgoqaqS
iIlrSVRKrMeCvWpdD6fjuYh4/AMrlxGKUdHwQu9KV08VBhW1B1kGxwC2ITIU
F89EIqSu1SLzeNwUVWFB1nU0eiH+B3zeIhk71VFe8E2oDbh1cRlzagW2r02D
Ug33kSh4ednhnHTV22L5XTUgMHpbvH8PvmPVhQwGDWY+qlH+eHtL5qAZJCTb
2D7CSZTNf0AWRXWonpPf1E8wpVLPneuvobiTsTPLSt0yHlW8M9gAJHWAmCT2
6HwiDMEUnyIhjW4A4DCgQunwuwCHGSPGIYWoq1mXaeW9Fra4bauidwX+oz4Y
U22gM67V58m42n4x6vbmATkYWDo9aks/hXib3lnQihTcWQAW4mc8C36P7Lxg
3YJjE0UqG70LovdAMNVFR6l/uYilgUGpYT0QCpZHNciP51hNDrKk8DlDgeII
LON7xFxtET2YmJwdKGUq94pKRVQ34Cv1qsFN8FhFfSDJlvL8chFddvuVKcIT
3iQ+WAkf5KoOn4pQuh7vR77wsT3JAaBlFnHrNVy7B5jgjRBpb6I54nVkvGQ0
YoI6E7OP0SZ/o/THkuEhUUxKn83gQz9CRqs/Fv85Xxw3Wpw9AQG2blGle+gA
6AdUhbvgGuc72KiWjJyImFh/Ph1lzBOSw/YHdpqKhB2KyjX09bepdKHYFh6p
TqHFRN5Di0xrM11/TJJY7Q8bb6sgOmGiGQkWaDBQCe9qt9EkW8JqEshkm7Ic
CDYsL4JjZlWHKugjZx8r3alW4U3Ah5htRXXyNYHr6JtpOT15KTHWckbJWQXM
efyIsqkrLeMFCIjq6SniYHbpUBFROFh5vxVBkk5UiliGWgICsI1nYun6Qxg7
rLG/H9swKTX1qNVyPUu0G2v7KClS1fhWpW6HrFyOwopTHOnoIDqasBe5HVdH
1xMoZfl27SA6BB/ofFq8Iz6di4BUUecSmwdFWomc937o+izsOrARGAonBRiq
2EP0+xiXH8BxUVDfToL6nu5UZzO5f86H36OOugO2wyNrB13qirJGpL93uNwk
iOSU8EQalsKqPejJdsx8clQqr1kDgb0g31lNpDAzfUhXbDGYG0VqBSgB1V8H
Mxi10hbm/nV1LpSWkN9oKmoTEyUZlUIcMxdmhCNZn1qLkBuKIjYUtoV4o9vG
cvaTDrAzT0KSAwOTfkfxqii2LyIppZlO6k2iefaf+Oc/xPIK/dA+xh87fpl3
kImLQOBj2Zx2Wk+STp3lUqeuZqUOP2TiUGe/no8dCIjCxtzObLghG1wltpRe
ZS89685IX40fIvsKVk54sB4b22g6yopQA4oFC5TwnpBbtqwIh7bMhqSDfaAv
RL+oG6KIxT0NWBixPaAqMghdKA6/tbkJwmvIZBCZlzlHZLmgfn5Y3t8/Is9v
31+o58/48etvH+4ff1FnMFwfCs7KLAF4lZQr/HG/fPp4mx8TbG1wnFdbgTEm
kUDb9+gJUlzORUO8f1rePdw9fFRnwGZdQHbCsiHEHi7xW5hDampZTXWRQgMg
VNGDqJFOhOsR8np2JK/e9ghc0osxBSCGSIBx8Q93Hx7J7hv2S2wZrY/8h2vS
Q7QbB2cK11yqZ9sMNfonlqv6ECM9jikAdVA5JedRYoCryi8g9FyuxeQX2J/a
xdrtgE7yaTP0QgIIV13Nk/ss6zIKJ76+Xt8jRWXKF85pKmeZ3kwiuzY7xAtZ
fMhMFcGymjr5xFMyK8o1RaokTK/C5PxxtDL3WOyYOXOIzQO84nVBUE7Dnf9j
gJArsxhVfHx4QBKFyI0fve72oIeBPefDWO7pYe1hm5CEpBk8Ns0SYmeVJwCR
M7eulJIiA55oABXetxMLYRCU6ZgQo9CJvcjX/q3eQQPXhPOR0Mah7hsPg9Ka
BmYfCyvY+0VMSqxKuouY4Hb6hSNYgmBKJpAIBFtFvEzaLw6g5ka+EWFZTNzz
XHOjGGftPZgO7EaAJpKSRYTLpOSiWDu8M/a/kX4uRks5MotF+sDnycjcjuId
iA7SKhwP8HJ9pkyOHQeV2TzfY40FjJzMr2eoSSnBjlgIK2l8Jh3OmVtg9C+b
gK74zQw3ypCcCJ9GzMy+XEgy5Cz45tVxyJf1SeRP9sWZlVhqjpur5CU5ghmh
hfDR4M1BbjPf4NIOzvxqmzhEXIHtkNGPHfNX3KsbiFYBkQhrafeQcLZwm7+b
2CLtHPvKlFWTkkzJqF7Q9x8rU/di/B6JnrTwKNkiGyGFckK+bUOlakZBx9ml
pGkHrIBoDwl8j52RsT4LNEcKbZnvfPp/TI8Fjxwe9aLhcn+JHET7OM/YWVJN
SAkZT9ORosvpnTwkVUnFnid5YsZDrjh25OKrdKRr9Zhel2oQg5Knk9BYs2IZ
61TyBavTRGo532JlQruGBJ/Vz5BHV0gTKm8q2H5OAeNCyTFvW48hRI37i9At
Hn1vkKxQFE8cS+LvBkxx9sv7p6dzvAEQ5YDk2UIcq17jPRKjDE3t9s3Qlaho
DVtW7Q8XcbiaVXL81CaDRjxsJkIDKSUQ9EfcJSlroDKTIZ8JaLmx3IHid0nM
sBWuyMhJJEmpi6PUaVTejO/q6d207uyj2+N2S5JP5/xmXRdvT7MSl2A85tXz
0Fw9cdJgzuMAfPx4FAf/PFEF1Zdhm+pu6nukt8/gTDSQqPe7I3FpSyFLwA41
R5NJkFDP+al5nQZx6sVqQGBedqdJ9vQZ66xzPdeIUJsmf5Iq2IPNXvH8pNL3
2+IRRIFyOpW7NIuI55Eu1G91aVBYvN3tYieZmzC0ZVwzKh+5njBwNeunRPQE
ocF6nKeTuaQtJ/ZMyKc/LlK56dFD78iekj8p8O/pmGjjvd5w6nql1vf/A/m/
gtI0YZp3fja9FjfyNqjI+2jFbPD7Ox/D8geI8RuWZ/Fh6y16c9Y+kmqNz+Ld
6518n0jfmdVtCwIMQ50/kXr3Iugcs+34iaPGezsVZ9neCtiO8zpNEZ5Bf+Q0
hpDfp2K/Enjzo+W8REYA87nC7NMpyl+L/T29lMux7P3mu2Gc0MWqNMvRCdip
P5IvCOrzz8/riXzjuVO1paHUs+kbBveabQ33dhyntuxxeUoWFEm8JFh108W8
k1q8SA5ZM5qL9LZpuj51xYntY2MZRu5Ip2eH4Q9d/Gozn8QxqVpiK3qiRuPd
I1qpOm3tbvBHZMTho0SeX8+Zhakaz04ztPHLuhSQ2YG2KPDs/zceJFXqkAcU
aN6j2DPFTa6ih7mC/f53ph/U3fJh+R0kHH9BGwEdocFXxvLUmleJSPw0Qg+M
n7zi145KRh5NnCOPgwAZkHL8KLNuIbZJY9w+PD8+/frb+m+rW66drqZ28dff
Vp+Wz7MbN4+fVz+vUURvlqvlT3f3d+u/TTdXy/UnvPF0+/7uZn33+BDvnJ78
L3wqmH3hIgAA

-->

</rfc>
