<?xml version="1.0" encoding="US-ASCII"?>
<?xml-model href="rfc7991bis.rnc"?>  <!-- Required for schema validation and schema-aware editing -->

<rfc
  xmlns:xi="http://www.w3.org/2001/XInclude"
  category="std"
  consensus="true"
  docName="draft-camarillo-rtgwg-lsn-00"
  ipr="trust200902"
  obsoletes=""
  updates=""
  submissionType="IETF"
  xml:lang="en"
  version="3">

  <front>
    <title abbrev="Lightspeed Notification Protocol">Lightspeed Notification Protocol</title>
    
    <author fullname="Pablo Camarillo Garvia" initials="P" role="editor"
            surname="Camarillo">
      <organization>Cisco</organization>
      <address>
        <postal>
          <street/>
          <city/>
          <region/>
          <code/>
          <country>Spain</country>
        </postal>

        <email>pcamaril@cisco.com</email>
      </address>
    </author>

    <author fullname="Clarence Filsfils" initials="C" surname="Filsfils">
      <organization>Cisco</organization>

      <address>
        <postal>
          <street/>
          <city/>
          <region/>
          <code/>
          <country>Belgium</country>
        </postal>

        <email>cf@cisco.com</email>
      </address>
    </author>

    <author fullname="Nadav Chachmon" initials="N" surname="Chachmon">
      <organization>Cisco</organization>

      <address>
        <postal>
          <street/>
          <city/>
          <region/>
          <code/>
          <country>Israel</country>
        </postal>

        <email>nchachmo@cisco.com</email>
      </address>
    </author>

    <author fullname="Ofer Iny" initials="O" surname="Iny">
      <organization>Cisco</organization>

      <address>
        <postal>
          <street/>
          <city/>
          <region/>
          <code/>
          <country>Israel</country>
        </postal>

        <email>oiny@cisco.com</email>
      </address>
    </author>

    <author fullname="Yuanchao Su" initials="Y" surname="Su">
      <organization>Alibaba</organization>

      <address>
        <postal>
          <street/>
          <city/>
          <region/>
          <code/>
          <country>China</country>
        </postal>

        <email>yitai.syc@alibaba-inc.com</email>
      </address>
    </author>

    <author fullname="Roy Jiang" initials="R" surname="Jiang">
      <organization>Alibaba</organization>

      <address>
        <postal>
          <street/>
          <city/>
          <region/>
          <code/>
          <country>China</country>
        </postal>

        <email>royjiang@aliyun-inc.com</email>
      </address>
    </author>

    <date year="2026"/>

    <area>Routing</area>
    <workgroup>RTGWG</workgroup>

    <keyword>Global Adaptive Routing</keyword>

    <abstract>
      <t>This document defines the Lightspeed Notification Protocol (LSN), a hardware-accelerated signaling mechanism designed for sub-100 microsecond network convergence in AI/ML data center fabrics. By operating entirely within the forwarding plane, LSN bypasses traditional CPU-based latencies to propagate link failures and congestion via a hardware-efficient encoding. It serves as a high-speed complement to routing protocols like BGP, providing an immediate hardware "veto" to prune congested/failed paths while maintaining control-plane stability for path recovery.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="Intro" title="Introduction">
      <t>Artificial Intelligence (AI) and Machine Learning (ML) workloads impose stringent demands on data center fabrics, characterized by high-bandwidth, synchronized and bursty collective communication patterns. As outlined in <xref target="draft-clad-rtgwg-ipfrr-aiml" />, these environments require network convergence times in the sub-100 microsecond range to avoid performance degradation caused by packet loss and jitter.</t>

      <t>Traditional routing protocols and convergence mechanisms rely on control-plane intervention, where link state changes are processed by the CPU. This introduces activation latencies in the tens of milliseconds, which is orders of magnitude too slow for AI/ML requirements. Furthermore, existing mechanisms typically operate on a binary "up/down" model, failing to account for capacity degradation or congestion.</t>

      <t>This document defines a Hardware-Accelerated Notification Protocol designed to operate entirely within the forwarding plane to handle both network failures and congestion events. This protocol serves as a complementary mechanism to traditional routing protocols (e.g., BGP). The interaction between the routing protocol and this notification mechanism is functionally equivalent to a boolean AND operation: a path is eligible for forwarding only if permitted by the routing protocol AND confirmed healthy by the hardware notification.</t>

      <t>This enforces an asymmetric reaction: "bad news" (failure or congestion) acts as an immediate hardware veto, disabling the path in sub-microseconds. Conversely, "good news" (recovery) is gated by the routing protocol, ensuring the control plane has fully converged before traffic resumes. This approach combines sub-microsecond protection with routing coherence/stability.</t>

      <section title="Requirements Language">
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
        "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
        "OPTIONAL" in this document are to be interpreted as described in BCP
        14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only
        when, they appear in all capitals, as shown here.</t>
      </section>
    </section>

    <section anchor="Problem" title="Problem Statement">
      <t>Current IP Fast Reroute (IPFRR) and resiliency mechanisms face fundamental limitations when applied to the extreme performance requirements of AI/ML fabrics.</t>

      <t><strong>Latency of CPU-Based Processing</strong></t>
      <t>Existing failure detection and propagation mechanisms generally trigger interrupts that must be processed by a line-card or system CPU. As noted in <xref target="draft-clad-rtgwg-ipfrr-aiml" />, this CPU-mediated path introduces an activation delay typically in the range of 10-50 milliseconds. For AI training workloads requiring barrier synchronization, this latency results in unacceptable flow disruption. To achieve the required sub-100 microsecond convergence, the propagation and processing of network state changes must occur with sub-microsecond latency, necessitating a solution implemented directly in hardware (NPU) logic without CPU intervention.</t>

      <t><strong>Inability to Handle Congestion and Partial Failures</strong></t>
      <t>Standard routing protocols and mechanisms like ECMP operate on a binary failure model-a link is either available or unavailable. However, AI fabrics frequently experience "brownout" scenarios, such as partial capacity reduction (e.g., loss of a single lane in a port group) or acute congestion. Current mechanisms lack the granularity to signal these states, causing traffic to be blackholed or hashed onto congested paths.</t>

      <t><strong>Limited Visibility of Remote Network State</strong></t>
      <t>In multi-stage Clos topologies and irregular scale-across networks, a failure or congestion event often occurs multiple hops away from the ingress point. Local protection mechanisms (like classic LFA) implemented at the point of failure are often insufficient or lead to suboptimal "hairpin" routing that increases latency. Current routing protocols do not provide ingress nodes with the real-time visibility into remote link states or congestion levels required to prevent traffic from entering compromised paths or to optimize load balancing across the fabric.</t>
    </section>

    <section anchor="Spec" title="Protocol Specification">
      <section anchor="format" title="Notification Packet Format">
        <t>The packet header is the following:</t>
            <figure align="left">
        <artwork align="left"><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                   MAC Destination Address (DA)                |
+-------------------------------+-------------------------------+
|      MAC DA (continued)       |    MAC Source Address (SA)    |
+-------------------------------+-------------------------------+
|                   MAC Source Address (continued)              |
+-------------------------------+-------------------------------+
|        EtherType (TBD1)       |         OpCode (TBD2)         |
+-------------------------------+-------------------------------+
|                                                               |
.                         LSN Payload                           .
.                                                               .
|                                                               |
+---------------------------------------------------------------+
|                           CRC (FCS)                           |
+---------------------------------------------------------------+
]]></artwork>
        </figure>

        <t>Notes:</t>
        <ul>
          <li>MAC Destination Address: 01-80-C2-00-00-01 (IEEE MAC-specific control protocols)</li>
          <li>Ethertype = TBD1 (Currently uses 0x8808 <xref target="IEEE802.3" /> MAC Control for interoperable running code)</li>
          <li>Opcode = TBD2 (Currently uses 0x5AA5 for interoperable running code)</li>
        </ul>

        <t>The LSN Payload is defined as follows:</t>
        <figure align="left">
        <artwork align="left"><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type  |R|Msg| Rsv | Dev-Range |                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
|                                                               |
|                   Reachable-Devices-Bitmap                    |
|                          (256 bits)                           |
|                                                               |
+                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork>
        </figure>

        <t>Field Descriptions:</t>
        <ul>
          <li>Header Type (4 bits): Value &apos;12&apos; - Identifies this as a topology reachability packet.</li>
          <li>R (1 bit):  Reserved. It MUST be set to zero upon transmission and ignored upon receival.</li>
          <li><t>Msg-type (2 bits): Indicates the information conveyed by the bitmap. It may represent reachability or congestion status. The congestion status is defined as 3 different congestion levels, which are implementation specific. The following Msg-Types are defined: </t>
            <ul>
              <li>0x0: Reachable</li>
              <li>0x1: Reachable and not congested (Congestion Level 1)</li>
              <li>0x2: Reachable and not congested (Congestion Level 2)</li>
              <li>0x3: Reachable and not congested (Congestion Level 3)</li>
            </ul></li>
          <li>Reserved (3 bits): Reserved. It MUST be set to zero upon transmission and ignored upon receival.</li>
          <li><t>Reachable-Devices-Range (6 bits): Determines the range of the reachable devices bitmap. The range per packet is within [Reachable-Devices-Range*256, Reachable-Devices-Range *256 + 255].</t>
            <ul>
              <li>&apos;0&apos;: Devices 0-255</li>
              <li>&apos;1&apos;: Devices 256-511</li>
              <li>&apos;2&apos;: Devices 512-767</li>
              <li>... (Covers up to 16K devices).</li>
            </ul></li>
          <li>Reachable-Devices-Bitmap (256 bits): One bit per device that indicates if the device is reachable through the device sending this message.</li>
        </ul>
      </section>

      <section anchor="Proceadures" title="Origination and Processing Procedures">
        <t>This section defines the procedures for assigning node identifiers, generating notification messages, and processing received notifications in hardware.</t>

        <section anchor="Ident" title="Node Identification">
          <t>To facilitate efficient bitmap-based signaling, every leaf node within the fabric is assigned a unique <strong>Global Node ID</strong>. This identifier maps the node to a specific bit position within the notification bitmap.</t>

          <t>The allocation, synchronization, and lifecycle management of these Global Node IDs are performed by an external central controller or management plane. The specific mechanisms for ID allocation are outside the scope of this document.</t>
        </section>

        <section anchor="Origination" title="Node Origination">
          <t>Each node is responsible for periodically generating notification messages that convey the direct reachability or congestion state of potential destinations to the rest of the fabric.</t>

          <t><strong>Message Scope and Fragmentation</strong></t>

          <t>A single notification message supports a payload covering up to 256 destination devices.</t>
          <ul>
            <li>The message header MUST specify the Reachable-Devices-Range (or Base ID), indicating the starting Global Node ID for the bitmap payload.</li>
            <li>If the fabric scale (Radix) exceeds 256 nodes, the originating node MUST generate multiple distinct notification messages. Each message corresponds to a different, non-overlapping Reachable-Devices-Range to cover the full topology.</li>
          </ul>

          <t><strong>MSG-Type and State Encoding</strong></t>
          <t>The interpretation of the bitmap payload is determined by the MSG-Type field in the header. The specific assignment of MSG-Type values is implementation-specific.</t>

          <ul>
            <li><t><strong>Reachability Information:</strong> When the MSG-Type indicates reachability, the bitmap encodes the binary status of the links:</t>
              <ul>
                <li><strong>Bit value 1:</strong> Indicates the destination node is directly reachable.</li>
                <li><strong>Bit value 0:</strong> Indicates the destination node is not reachable.</li>
              </ul></li>

            <li><strong>Congestion Information:</strong> When the MSG-Type indicates congestion, the bitmap conveys the congestion status of the associated paths. The mapping of queue congestion levels to the various congestion values is implementation-specific, provided it supports the hardware logic required for path pruning or de-prioritization.</li>
          </ul>

          <t>As mentioned earlier, the messages are periodically generated. However, if there is a significant change in the quality of one path (congestion level change, of interface status change), a notification message is also triggered.</t>
        </section>

        <section anchor="process" title="Hardware Processing and Aggregation">
          <t>Upon receipt of a notification message, the receiving node MUST process the packet directly in the forwarding hardware without punting to the control plane CPU.</t>

          <t><strong>State Aggregation</strong></t>
          <t>The receiving node maintains a global state table representing the health of the fabric. When a message is received:</t>
          <ol>
            <li>The hardware identifies the segment of the global state corresponding to the Reachable-Devices-Range in the packet.</li>
            <li>The received bitmap is overlaid onto the local state, updating the reachability or congestion status for the specific subset of nodes.</li>
          </ol>

          <t><strong>Forwarding Decision Logic</strong></t>
          <t>To derive the final forwarding decision, the hardware performs a bitwise operation between the standard routing table and the aggregated notification state.</t>
          <ul>
            <li>For reachability, this is functionally a <strong>Bitwise AND</strong> operation: A path is valid if and only if the Routing Protocol has installed it (Logical 1) <strong>AND</strong> the Notification Protocol reports it as reachable (Logical 1).</li>
            <li>If the result of this operation is 0 for a specific next-hop or path, the hardware immediately removes that path from the Equal-Cost Multi-Path (ECMP) set or activates a pre-programmed backup path.</li>
          </ul>

        </section>

      </section>
    </section>

    <section anchor="DC" title="Illustration: Usage in the DC">
      <t>This section illustrates the operation of the Hardware-Accelerated Notification Protocol within a massive scale 2-tier Spine-Leaf Data Center topology. This example focuses exclusively on <strong>reachability</strong> signaling to demonstrate the bitwise logic used to prune invalid paths.</t>

      <section anchor="topo" title="Topology and ID Assignment">
        <t>Consider a fully connected Clos fabric consisting of <strong>256 Spine switches</strong> and <strong>256 Leaf switches</strong>.</t>
        <ul>
          <li><strong>Spine Naming Terminology:</strong> The Spines are identified as SA, SB, SC, ..., SZ, SAA, SAB, ..., SIU, SIV.</li>
          <li><strong>Leaf Naming &amp; Global IDs:</strong> The Leaves are identified as L0, L1, L2, ... through L255. Each Leaf is assigned a Global Node ID corresponding to its index (e.g., Leaf_5 has ID 5).</li>
          <li><strong>Traffic Flow:</strong> An ingress Leaf (L_Ingress) sends traffic to a destination Leaf_5.</li>
          <li><strong>Routing State:</strong> Under normal operation, BGP advertises that Leaf_5 is reachable via all 256 Spines. The Ingress Leaf maintains an ECMP group for Leaf_5 containing 256 next-hops: {SA, SB, SC, ... SIV}.</li>
        </ul>
      </section>

      <section anchor="failure" title="Failure Scenario">
        <t>A physical link failure occurs between Spine_A and the destination Leaf_5.</t>
        <ol>
          <li><t><strong>Detection and Generation:</strong> Spine_A detects the port connected to Leaf_5 is down. The hardware at Spine A immediately generates a notification packet to be transmitted to all connected leaves (including L_Ingress).</t>
            <ul>
              <li><strong>MAC Destination Address:</strong> 01-80-C2-00-00-01 (MAC Control)</li>
              <li><strong>MAC Source Address:</strong> Spine_A</li>
              <li><strong>Reachable-Devices-Range (Base ID):</strong> 0 (Covering Leaves 0-255).</li>
              <li><strong>MSG-Type:</strong> Reachability.</li>
              <li><strong>Bitmap Payload:</strong> A 256-bit string where the bit at index 5 is set to 0 (Unreachable), and all other connected leaves are set to 1 (Reachable).</li>
            </ul>
            <figure align="left">
            <artwork align="left"><![CDATA[
Bitmap from Spine A: [1, 1, 1, 1, 1, 0, 1, ... 1]
                                     ^
                                  Index 5 (L5) set to 0
]]></artwork>
            </figure>            
          </li>

          <li><t><strong>Hardware Processing at Ingress:</strong> The Ingress Leaf (L_Ingress) receives the notification from Spine_A. The forwarding engine performs a logical <strong>AND</strong> operation between the Routing Table state and the received Notification state specifically for the path via Spine_A.</t>
            <ul>
              <li><t><strong>Path via Spine_A:</strong></t>
                <ul>
                  <li>Routing Table State for Leaf_5: 1 (Control plane has not yet converged; leaf_5 still considered reachable).</li>
                  <li>Notification State from Spine_A for Leaf_5: 0 (Hardware Notification reported Down).</li>
                  <li><strong>Result:</strong> 1 AND 0 = 0 (Path Invalid).</li>
                </ul>
              </li>

              <li><t><strong>Path via Spine_B (and others):</strong></t>
                <ul>
                  <li>Routing Table State for Leaf_5: 1.</li>
                  <li>Notification State from Spine_B for Leaf_5: 1 (No failure reported by Spine_B).</li>
                  <li><strong>Result:</strong> 1 AND 1 = 1 (Path Valid).</li>
                </ul>
              </li>
            </ul>
          </li>

          <li><strong>Forwarding Result:</strong> Because the result for Spine_A is 0, the forwarding hardware excludes Spine_A from the ECMP group for destination Leaf_5. Traffic to Leaf_5 is instantly rebalanced across the remaining 255 Spines (B...IV). Balancing of traffic to the other leaves is unchanged.</li>
        </ol>

        <t>This mechanism ensures sub-microsecond protection. When the link Spine_A-Leaf_5 is eventually restored, Spine_A will send notification message with an updated bitmap where index 5 is set to 1. However, traffic will not resume via Spine_A until BGP also re-installs the route, satisfying the 1 AND 1 condition.</t>
      </section>
    </section>

    <section anchor="DCI" title="Illustration: Usage across DCs">
      <t>Future revisions of this document will document how the mechanism defined here can be used for DCI/regional networks.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>The Lightspeed Notification Protocol (LSN) is designed to achieve with sub-100 microsecond convergence directly in the forwarding hardware. To achieve this performance, LSN messages do not include cryptographic authentication or integrity checks. Consequently, the protocol introduces several security considerations that must be mitigated by the deployment architecture.</t>

      <t><strong>Spoofing and Denial of Service (DoS):</strong> Because LSN messages are unauthenticated, an attacker capable of injecting forged LSN frames into the fabric could broadcast false congestion or failure notifications. This would cause receiving nodes to prune valid paths from their forwarding tables, potentially leading to forced congestion, suboptimal routing, or a complete denial of service. </t>

      <t><strong>Mitigation via Boundary Filtering:</strong> LSN is intended exclusively for internal use within highly controlled, single-domain AI/ML data center fabrics. To prevent spoofing, boundary nodes and leaf switches MUST implement strict port-level filtering. LSN notification packets MUST be dropped unconditionally on any port facing a server, host, or external network. LSN processing MUST only be enabled on trusted, switch-to-switch infrastructure links.</t>

      <t><strong>Fail-Safe Routing Logic:</strong> As defined in this document, LSN acts only as a hardware "veto" (a logical AND operation with the routing protocol). While an attacker can maliciously prune a path, they cannot force the network to forward traffic into a blackhole or an invalid topology segment, because reachability requires the routing protocol (e.g., BGP) to also authorize the path. The control plane remains the ultimate source of truth for topology validation.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>This document requests the following allocations.</t>
      
      <ul>
        <li>
          <strong>Ethertype:</strong> A new Ethertype for the Lightspeed Notification Protocol (LSN).
        </li>
        <li>
          <strong>Opcode:</strong> A new Opcode for LSN Bitmap Reachability.
        </li>
      </ul>
      
      <t>Upon assignment of these values, the RFC Editor is requested to replace all instances of "TBD" for the Ethertype and Opcode with the newly allocated hexadecimal values.</t>
    </section>

    <section anchor="ACK" title="Acknowledgements">
      <t>The authors would like to acknowledge the following people: Kris Michielsen.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
      <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
    </references>

    <references title="Informative References">
      <reference anchor="draft-clad-rtgwg-ipfrr-aiml" target="https://datatracker.ietf.org/doc/draft-clad-rtgwg-ipfrr-aiml/">
        <front>
          <title>IP Fast Reroute for AI/ML Fabrics</title>
          <author fullname="Francois Clad" initials="F." surname="Clad"/>
          <author fullname="Clarence Filsfils" initials="C." surname="Filsfils"/>
          <author fullname="Roy Jiang" initials="R." surname="Jiang"/>
          <author fullname="Dennis Cai" initials="D." surname="Cai"/>
          <date month="March" day="2" year="2026"/>
        </front>
      </reference>

      <reference anchor="IEEE802.3"
                 target="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;arnumber=9844436">
        <front>
          <title>IEEE Standard for Ethernet; IEEE Std 802.3-2022</title>

          <author>
            <organization>IEEE computer Society</organization>
          </author>

          <date month="May" day="13" year="2022"/>
        </front>
      </reference>
      
    </references>
  </back>
</rfc>
