<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc [
  <!ENTITY RFC2119 SYSTEM "https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml">
  <!ENTITY RFC3168 SYSTEM "https://bib.ietf.org/public/rfc/bibxml/reference.RFC.3168.xml">
  <!ENTITY RFC4443 SYSTEM "https://bib.ietf.org/public/rfc/bibxml/reference.RFC.4443.xml">
  <!ENTITY RFC4884 SYSTEM "https://bib.ietf.org/public/rfc/bibxml/reference.RFC.4884.xml">
  <!ENTITY RFC8174 SYSTEM "https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml">
  <!ENTITY RFC8754 SYSTEM "https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8754.xml">
]>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude"
     category="std"
     docName="draft-tian-ccwg-long-haul-cnp-00"
     ipr="trust200902"
     submissionType="IETF"
     consensus="true"
     version="3">

  <front>
    <title abbrev="Long-haul CNP">Multi-level Congestion Response
    Framework with Long-haul Congestion Notification for DCI
    Networks</title>

    <seriesInfo name="Internet-Draft"
                value="draft-tian-ccwg-long-haul-cnp-00"/>

    <author fullname="Yuchi Tian" initials="Y." surname="Tian">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>tianyuchi@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Jin Yang" initials="J." surname="Yang">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>yangjin@chinamobile.com</email>
      </address>
    </author>

    <author fullname="Weiqiang Cheng" initials="W." surname="Cheng">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>chengweiqiang@chinamobile.com</email>
      </address>
    </author>


    <author fullname="Junjie Wang" initials="J." surname="Wang">
      <organization>Centec</organization>
      <address>
        <postal>
          <city>Suzhou</city>
          <code>215000</code>
          <country>China</country>
        </postal>
        <email>wangjj@centec.com</email>
      </address>
    </author>

    <author fullname="Guoying Zhang" initials="G." surname="Zhang">
      <organization>Centec</organization>
      <address>
        <postal>
          <city>Suzhou</city>
          <code>215000</code>
          <country>China</country>
        </postal>
        <email>zhanggy@centec.com</email>
      </address>
    </author>

    <author fullname="Kan Zhang" initials="K." surname="Zhang">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>zhangkan@chinamobile.com</email>
      </address>
    </author>

    <date year="2026" month="February" day="27"/>

    <area>Transport</area>
    <workgroup>CCWG</workgroup>

    <keyword>congestion control</keyword>
    <keyword>RDMA</keyword>
    <keyword>RoCEv2</keyword>
    <keyword>Long-haul CNP</keyword>
    <keyword>DCI</keyword>

    <abstract>
      <t>This document specifies a multi-level congestion response
      framework and an associated Long-haul Congestion Notification
      Packet (Long-haul CNP) for Data Center Interconnect (DCI)
      wide-area network scenarios. The framework defines a graduated
      congestion response mechanism: lightweight ECN marking for
      incipient congestion and device-originated Long-haul CNP for
      severe or rapidly worsening congestion. Long-haul CNP packets
      carry explicit control instructions (e.g., rate reduction
      percentage, pause duration) and are sent directly by
      congestion-aware intermediate nodes to the traffic source via
      unicast, reducing feedback latency compared to receiver-mediated
      congestion notification. The document also specifies a
      multi-device collaborative suppression mechanism and
      BDP-adaptive dynamic threshold calculation for long-haul links.
      Two packet encapsulation formats are defined: an ICMPv6
      extension and a RoCEv2 backward-compatible extension.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="introduction" numbered="true" toc="default">
      <name>Introduction</name>

      <t>RDMA over Converged Ethernet v2 (RoCEv2) is widely deployed in
      data center networks for high-performance computing and AI
      training workloads. Within a single data center, congestion
      control mechanisms such as DCQCN
      <xref target="DCQCN"/> and ECN-based schemes provide
      effective flow control. However, when RoCEv2 traffic traverses
      Data Center Interconnect (DCI) wide-area networks, the existing
      congestion notification path ("switch marks ECN, receiver
      generates CNP, CNP returns to sender") introduces feedback
      latency proportional to the WAN round-trip time, which can reach
      tens of milliseconds.</t>

      <t>Recent work on Fast CNP
      <xref target="I-D.xiao-rtgwg-rocev2-fast-cnp"/> has addressed
      the fundamental latency issue by enabling switches to generate
      CNP packets directly to the sender. This document builds upon
      that foundation by specifying three complementary mechanisms that
      are particularly relevant for long-haul DCI scenarios:</t>

      <ol>
        <li>A graduated multi-level trigger framework that
        distinguishes between incipient and severe congestion, avoiding
        unnecessary control packet generation for transient queue
        buildup.</li>
        <li>A BDP-adaptive dynamic threshold calculation that
        automatically adjusts congestion detection sensitivity based on
        link characteristics (bandwidth, distance, RTT).</li>
        <li>A multi-device collaborative suppression mechanism that
        coordinates congestion responses across multiple intermediate
        nodes on a data flow path, preventing redundant or conflicting
        control instructions.</li>
      </ol>

      <t>Additionally, this document defines an extended packet format
      (Long-haul CNP) that carries explicit control instructions with
      quantified congestion metrics, enabling the source to perform
      precise rate adjustments rather than relying on generic rate
      reduction heuristics.</t>

      <section anchor="requirements-language" numbered="true"
               toc="default">
        <name>Requirements Language</name>
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
        "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT
        RECOMMENDED", "MAY", and "OPTIONAL" in this document are to
        be interpreted as described in BCP 14 <xref target="RFC2119"/>
        <xref target="RFC8174"/> when, and only when, they appear in
        all capitals, as shown here.</t>
      </section>
    </section>

    <section anchor="terminology" numbered="true" toc="default">
      <name>Terminology</name>
      <dl>
        <dt>RDMA:</dt>
        <dd>Remote Direct Memory Access.</dd>
        <dt>RoCEv2:</dt>
        <dd>RDMA over Converged Ethernet version 2
        <xref target="RoCEv2"/>.</dd>
        <dt>CNP:</dt>
        <dd>Congestion Notification Packet, as defined in the RoCEv2
        specification. The standard CNP uses BTH Opcode 0x81.</dd>
        <dt>Fast CNP:</dt>
        <dd>Fast Congestion Notification Packet, a switch-originated
        CNP as defined in
        <xref target="I-D.xiao-rtgwg-rocev2-fast-cnp"/>.</dd>
        <dt>Long-haul CNP:</dt>
        <dd>Long-haul Congestion Notification Packet, the extended
        packet type defined in this document, carrying explicit
        control instructions and congestion metrics.</dd>
        <dt>ECN:</dt>
        <dd>Explicit Congestion Notification
        <xref target="RFC3168"/>.</dd>
        <dt>QP:</dt>
        <dd>Queue Pair, a fundamental RDMA communication
        abstraction.</dd>
        <dt>BTH:</dt>
        <dd>Base Transport Header, the common header in all RoCEv2
        packets.</dd>
        <dt>RTT:</dt>
        <dd>Round-Trip Time.</dd>
        <dt>PFC:</dt>
        <dd>Priority-based Flow Control.</dd>
        <dt>DCI:</dt>
        <dd>Data Center Interconnect.</dd>
        <dt>BDP:</dt>
        <dd>Bandwidth-Delay Product.</dd>
        <dt>Congestion-Aware Intermediate Node:</dt>
        <dd>A network device deployed on the DCI path that implements
        the multi-level congestion monitoring and Long-haul CNP
        generation functions specified in this document. This may be a
        router, switch, or dedicated DCI gateway device.</dd>
      </dl>
    </section>

    <section anchor="related-work" numbered="true" toc="default">
      <name>Related Work and Positioning</name>

      <t>This section describes the relationship between this document
      and existing congestion notification mechanisms for RoCEv2
      networks.</t>

      <t><xref target="I-D.xiao-rtgwg-rocev2-fast-cnp"/> defines the
      Fast CNP mechanism, which enables a switch to generate a CNP
      packet directly to the sender when it detects congestion, without
      waiting for the receiver to generate the CNP. Fast CNP provides
      the foundational mechanism for switch-originated congestion
      notification.</t>

      <t>This document extends the Fast CNP concept in three
      directions:</t>

      <dl>
        <dt>Graduated Response:</dt>
        <dd>While Fast CNP triggers upon detecting congestion, this
        document defines a two-level response where lightweight ECN
        marking handles incipient congestion and Long-haul CNP
        generation is reserved for severe or rapidly worsening
        conditions. This reduces control packet overhead in mildly
        congested scenarios.</dd>

        <dt>Instructional Control:</dt>
        <dd>While Fast CNP advises the sender to reduce its rate, the
        Long-haul CNP packet format defined here carries explicit
        action codes (notify, pause, rate reduce, resume) with
        quantified parameters (e.g., reduction percentage, pause
        duration) and congestion metrics, enabling more precise
        source-side rate adjustment.</dd>

        <dt>Multi-device Coordination:</dt>
        <dd>In DCI scenarios where a flow traverses multiple
        congestion-aware nodes, this document defines coordination
        rules to prevent duplicate or conflicting control instructions
        from reaching the same source.</dd>
      </dl>

      <t>The mechanisms defined in this document are complementary to
      SRv6-based congestion control approaches such as those described
      in <xref target="I-D.liu-spring-srv6-cc"/> and
      <xref target="I-D.hu-rtgwg-rocev2-fcn"/>. When used within
      SRv6-based DCI networks <xref target="RFC8754"/>, the
      Long-haul CNP can be encapsulated within the applicable SRv6
      transport framework.</t>

      <t>RTT-based congestion control approaches such as TIMELY
      <xref target="TIMELY"/> provide an alternative signal (delay)
      for inferring congestion severity; the Congestion Metric field
      defined in this document can optionally carry RTT-derived
      information to complement queue-based metrics.</t>
    </section>

    <section anchor="applicability" numbered="true" toc="default">
      <name>Applicability Statement</name>

      <t>The mechanisms specified in this document are primarily
      designed for Data Center Interconnect (DCI) scenarios where
      RoCEv2 traffic traverses wide-area network paths with
      non-trivial propagation delay (typically RTT greater than 1 ms).
      In such environments, the receiver-mediated CNP feedback path
      introduces significant latency, and the BDP-adaptive threshold
      mechanism provides meaningful dynamic range.</t>

      <t>For intra-data-center deployments where RTT is sub-millisecond
      and paths traverse few hops, the standard ECN/CNP or Fast CNP
      mechanisms are typically sufficient, and the additional complexity
      of the multi-level framework may not be warranted.</t>

      <t>The multi-device collaborative suppression mechanism is most
      beneficial when data flows traverse two or more congestion-aware
      intermediate nodes, which is common in multi-hop DCI
      topologies.</t>

      <t>Regarding IP version applicability: the ICMPv6 packet format
      defined in <xref target="icmpv6-format"/> is applicable only to
      IPv6 network deployments. For DCI environments that operate over
      IPv4, implementations MUST use the RoCEv2 backward-compatible
      extension format defined in <xref target="rocev2-format"/>. A
      future document may define an ICMPv4-based format if there is
      sufficient demand for ICMP-based Long-haul CNP in IPv4-only
      deployments.</t>
    </section>

    <section anchor="protocol-specification" numbered="true"
             toc="default">
      <name>Protocol Specification</name>

      <section anchor="architecture-overview" numbered="true"
               toc="default">
        <name>Architecture Overview</name>

        <t>The following diagram illustrates a typical DCI topology
        where this mechanism operates:</t>

        <figure anchor="fig-arch">
          <name>DCI Network Topology with Congestion-Aware Nodes</name>
          <artwork type="ascii-art"><![CDATA[
  +------+     +------+                   +------+     +------+
  |Source|---->| Node |======WAN Path=====>| Node |---->| Dest |
  | NIC  |    |  N1  |                    |  N2  |    | NIC  |
  +------+    +------+                    +------+    +------+
     ^            |                           |
     |            | Long-haul CNP             |
     +------------+ (unicast to source)       |
     |                                        |
     +----------------------------------------+
       Long-haul CNP (if N1 control insufficient)
]]></artwork>
        </figure>

        <t>The mechanism operates as follows:</t>

        <ol>
          <li>Congestion-aware intermediate nodes (N1, N2) learn
          flow state by inspecting traversing RoCEv2 data
          packets.</li>
          <li>Each node continuously monitors egress queue status
          using multiple congestion indicators.</li>
          <li>When incipient congestion is detected (queue depth
          exceeds K_min), the node applies ECN marking to traversing
          data packets (first-level response).</li>
          <li>When severe congestion is detected (any second-level
          condition is met), the node generates a Long-haul CNP
          packet and sends it via unicast to the traffic source
          (second-level response).</li>
          <li>The source parses the Long-haul CNP and adjusts the
          sending rate of the indicated QP according to the
          control instruction.</li>
          <li>If multiple nodes detect congestion for the same flow,
          the collaborative suppression mechanism coordinates their
          responses.</li>
        </ol>
      </section>

      <section anchor="flow-table" numbered="true" toc="default">
        <name>Flow Table Learning and Maintenance</name>

        <t>A congestion-aware intermediate node MUST parse the
        Base Transport Header (BTH) of traversing RoCEv2 data packets
        and extract the following flow identification information:
        Source IP Address, Destination IP Address, Source QP Number,
        and Destination QP Number.</t>

        <t>The node SHOULD maintain a flow table with one entry per
        unique four-tuple (Source IP, Destination IP, Source QP,
        Destination QP). Flow table entries MUST be updated upon
        each matching packet observation. Entries MAY be subject to
        an aging timer; when no matching packets are observed within
        the configured aging period, the entry SHOULD be removed.</t>

        <t>The flow table is used to construct Long-haul CNP packets
        with the correct addressing information when congestion is
        detected.</t>
      </section>

      <section anchor="congestion-monitoring" numbered="true"
               toc="default">
        <name>Multi-level Congestion Monitoring</name>

        <section anchor="monitoring-metrics" numbered="true"
                 toc="default">
          <name>Monitoring Metrics</name>

          <t>Congestion-aware intermediate nodes MUST continuously
          monitor the following metrics on each egress port:</t>

          <dl>
            <dt>Queue Depth (QD):</dt>
            <dd>The instantaneous volume of data buffered in the
            egress queue, measured in bytes or cells.</dd>

            <dt>ECN Marking Rate (EMR):</dt>
            <dd>The fraction of data packets marked with ECN per
            unit time on the egress port.</dd>

            <dt>Queue Growth Rate (QGR):</dt>
            <dd>The rate of change of queue depth over a
            configurable measurement interval, indicating whether
            congestion is building or subsiding.</dd>
          </dl>

          <t>The node SHOULD also maintain an estimate of the
          link round-trip time (RTT_est) for each egress port, which
          MAY be obtained through control plane configuration,
          active probing, or receiver reporting.</t>
        </section>

        <section anchor="dynamic-threshold" numbered="true"
                 toc="default">
          <name>BDP-Adaptive Dynamic Threshold Calculation</name>

          <t>To account for the wide variation in link
          characteristics across DCI paths, the upper queue depth
          threshold K_max MUST be dynamically calculated based on
          the Bandwidth-Delay Product (BDP) of the link:</t>

          <artwork><![CDATA[
   K_max = max(K_base, alpha * R_port * RTT_est / 8)
]]></artwork>

          <dl>
            <dt>R_port:</dt>
            <dd>The egress port rate in bits per second.</dd>
            <dt>RTT_est:</dt>
            <dd>The estimated round-trip time in seconds.</dd>
            <dt>K_base:</dt>
            <dd>A baseline queue threshold providing a minimum
            sensitivity floor for short-distance or low-speed
            links. The RECOMMENDED default value is implementation-
            specific but SHOULD correspond to at least one maximum-
            sized RoCEv2 frame.</dd>
            <dt>alpha:</dt>
            <dd>An adjustment coefficient. The RECOMMENDED default
            value is 1.0. Implementations MAY allow this to be
            configured per-port or per-link.</dd>
          </dl>

          <t>The minimum threshold K_min, used for first-level ECN
          marking, MUST be configured to a value less than K_max.
          A RECOMMENDED default is K_min = K_max / 2.</t>

          <t>Implementations SHOULD recalculate K_max periodically
          or upon RTT_est changes, to adapt to evolving link
          conditions.</t>
        </section>
      </section>

      <section anchor="trigger-response" numbered="true"
               toc="default">
        <name>Multi-level Trigger Response</name>

        <t>This document defines two levels of congestion
        response:</t>

        <dl>
          <dt>First-level response (ECN marking):</dt>
          <dd>When the queue depth QD exceeds K_min, the node MUST
          apply ECN marking (setting the CE codepoint per
          <xref target="RFC3168"/>) to traversing data packets.
          This activates the standard end-to-end ECN/CNP feedback
          loop and does not generate any additional control
          packets. First-level response operates as a lightweight,
          low-overhead mechanism for handling transient or mild
          congestion.</dd>

          <dt>Second-level response (Long-haul CNP generation):</dt>
          <dd>The node MUST trigger Long-haul CNP generation when
          any of the following conditions is satisfied:
          (a) Queue depth QD exceeds K_max;
          (b) ECN marking rate EMR exceeds a configured threshold
          V_ecn;
          (c) Queue growth rate QGR exceeds a configured threshold
          V_growth.
          Second-level response is intended for severe or rapidly
          worsening congestion where the first-level ECN feedback
          loop cannot respond quickly enough due to WAN path
          latency.</dd>
        </dl>
      </section>

      <section anchor="cnp-generation" numbered="true" toc="default">
        <name>Long-haul CNP Generation and Transmission</name>

        <t>When a second-level trigger condition is met, the
        congestion-aware intermediate node MUST perform the
        following procedure:</t>

        <ol>
          <li>Identify the data flow(s) contributing to the
          congested queue. The node SHOULD select the flow(s) with
          the highest contribution to the queue occupancy when
          multiple flows share the queue.</li>
          <li>For each selected flow, look up the corresponding
          flow table entry to obtain the source IP address and
          source QP number.</li>
          <li>Determine the appropriate action instruction (Action
          Flags) and parameter values based on the current
          congestion severity (see
          <xref target="action-determination"/>).</li>
          <li>Construct a Long-haul CNP packet as specified in
          <xref target="packet-formats"/>, populating all required
          fields.</li>
          <li>Send the packet via unicast to the source IP address
          of the identified flow.</li>
        </ol>

        <t>The node MUST rate-limit Long-haul CNP generation to
        prevent excessive control traffic. The RECOMMENDED minimum
        interval between consecutive Long-haul CNP packets for the
        same flow is one estimated RTT (RTT_est).</t>

        <section anchor="action-determination" numbered="true"
                 toc="default">
          <name>Action Instruction Determination</name>

          <t>The Action Flags field in the Long-haul CNP packet
          encodes the specific control action requested of the
          source. The following guidelines apply:</t>

          <dl>
            <dt>Notify (00):</dt>
            <dd>Informs the source that congestion exists. The
            source SHOULD apply its default congestion response
            algorithm. This action is used when congestion is
            detected but not yet severe.</dd>
            <dt>Pause (01):</dt>
            <dd>Instructs the source to temporarily halt
            transmission on the indicated QP for the duration
            specified in the Parameter field (in microseconds).
            This action SHOULD be used only for severe congestion
            where rate reduction alone is insufficient.</dd>
            <dt>Rate Reduce (10):</dt>
            <dd>Instructs the source to reduce its sending rate on
            the indicated QP by the percentage specified in the
            Parameter field. For example, a Parameter value of 30
            indicates a 30% rate reduction.</dd>
            <dt>Resume (11):</dt>
            <dd>Instructs the source that congestion has subsided
            and the source MAY restore its sending rate on the
            indicated QP. The Parameter field specifies the
            permitted rate recovery percentage relative to the
            rate prior to the last congestion action. For example,
            a Parameter value of 50 indicates the source MAY
            increase its rate by 50% of the reduction that was
            previously applied. A Parameter value of 0 indicates
            unconditional resume to the original rate. This action
            is generated when a congestion-aware intermediate node
            observes that queue depth has fallen below K_min for
            a sustained period (RECOMMENDED: at least 1 *
            RTT_est).</dd>
          </dl>
        </section>
      </section>

      <section anchor="source-behavior" numbered="true"
               toc="default">
        <name>Source Behavior upon Receiving Long-haul CNP</name>

        <t>Upon receiving a Long-haul CNP packet, the source
        MUST:</t>

        <ol>
          <li>Validate the packet by checking that the Source QP
          Number matches a locally active QP and that the source
          IP of the Long-haul CNP belongs to a known intermediate
          node (see <xref target="security"/>).</li>
          <li>Parse the Action Flags and Parameter fields.</li>
          <li>Apply the indicated action to the corresponding QP's
          sending rate or transmission state.</li>
        </ol>

        <t>The source SHOULD maintain a per-QP timer. If no new
        Long-haul CNP packet is received for the same QP within a
        configurable recovery interval (RECOMMENDED: 2 * RTT_est),
        the source SHOULD gradually increase its sending rate using
        an additive-increase algorithm until normal rate is
        restored or a new Long-haul CNP is received.</t>

        <t>Upon receiving a Resume action, the source SHOULD
        increase its sending rate by the percentage indicated in the
        Parameter field. The source MAY combine the timer-based
        recovery mechanism with explicit Resume actions: when a
        Resume is received, the source applies the indicated rate
        increase immediately rather than waiting for the recovery
        timer.</t>

        <t>If a source receives a Long-haul CNP but does not
        support the Long-haul CNP format, it MUST silently discard
        the packet (ICMPv6 format) or process it as a standard CNP
        (RoCEv2 format). This ensures backward compatibility with
        sources that only support standard CNP. See
        <xref target="backward-compat"/> for details.</t>
      </section>

      <section anchor="multi-device" numbered="true" toc="default">
        <name>Multi-device Collaborative Congestion Suppression</name>

        <t>When a data flow traverses multiple congestion-aware
        intermediate nodes, uncoordinated Long-haul CNP generation
        can result in duplicate or conflicting control instructions
        reaching the source. This section specifies coordination
        rules to mitigate this issue.</t>

        <dl>
          <dt>Upstream Priority:</dt>
          <dd>When congestion occurs, the node closest to the
          congestion point (in the upstream direction toward the
          source) SHOULD respond first. Because this node's
          Long-haul CNP has the shortest path to the source, it
          achieves the fastest feedback.</dd>

          <dt>Downstream Deferral:</dt>
          <dd><t>A downstream congestion-aware node that detects
          congestion for a given flow SHOULD check whether a
          Long-haul CNP for the same flow has recently been
          generated by an upstream node. This can be inferred by
          observing a reduction in the flow's arrival rate within
          a configurable observation window (RECOMMENDED: 1 *
          RTT_est). If such a reduction is observed:</t>
          <ol type="a">
            <li>The downstream node SHOULD temporarily suppress
            Long-haul CNP generation for that flow, to avoid
            sending duplicate instructions.</li>
            <li>If the downstream node's congestion metrics
            continue to worsen despite the observation window (i.e.,
            the upstream control has not been effective), the
            downstream node MUST generate a Long-haul CNP with a
            higher Congestion Level value and a stricter action
            instruction (e.g., escalating from Rate Reduce to
            Pause).</li>
          </ol></dd>
        </dl>
      </section>

      <section anchor="dynamic-param" numbered="true" toc="default">
        <name>Dynamic Parameter Adjustment</name>

        <t>Congestion-aware intermediate nodes MAY dynamically
        adjust the threshold parameters (K_min, K_max, V_ecn,
        V_growth) and rate-limiting intervals based on observed
        traffic characteristics such as average queue occupancy,
        traffic burstiness patterns, and link utilization history.
        The specific algorithms for such adjustment are
        implementation-dependent and outside the scope of this
        document.</t>
      </section>
    </section>

    <section anchor="packet-formats" numbered="true" toc="default">
      <name>Packet Formats</name>

      <t>This document defines two Long-haul CNP packet formats.
      Implementations MUST support at least one format and SHOULD
      indicate the supported format(s) through out-of-band
      configuration or capability exchange.</t>

      <section anchor="icmpv6-format" numbered="true" toc="default">
        <name>ICMPv6 Extension Format</name>

        <t>This format encapsulates the Long-haul CNP as a new
        ICMPv6 informational message type
        <xref target="RFC4443"/>. Because the Long-haul CNP is a
        new ICMPv6 message type with a fully defined fixed-length
        body (no variable-length "original datagram" field), the
        length ambiguity problem addressed by
        <xref target="RFC4884"/> does not apply. However, the
        Optional Extension Objects defined in this section adopt
        the extension structure format from
        <xref target="RFC4884"/> (Extension Header with Version
        and Checksum, and Extension Objects with Class-Num, C-Type,
        Length) for consistency with IETF ICMP extension
        conventions and to enable reuse of existing ICMP extension
        parsing implementations.</t>

        <figure anchor="fig-icmpv6">
          <name>Long-haul CNP in ICMPv6 Format</name>
          <artwork type="ascii-art"><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     Type      |     Code      |           Checksum            |
+---------------+---------------+-------------------------------+
|  Cong. Level  | Action Flags  |          Parameter            |
+---------------+---------------+-------------------------------+
|                      Source QP Number                         |
+---------------------------------------------------------------+
|  Metric Type  |         Congestion Metric Value (24 bits)     |
+---------------+-----------------------------------------------+
|          Extension Header (optional, see below)               |
+---------------------------------------------------------------+
|          Extension Object(s) (optional, variable)             |
~                                                               ~
+---------------------------------------------------------------+
]]></artwork>
        </figure>

        <t>The fixed-length body of the Long-haul CNP ICMPv6
        message is 12 octets (3 x 32-bit words), comprising the
        fields from Congestion Level through Congestion Metric.
        This fixed length is known to all implementations, so the
        boundary between the fixed body and any extension structure
        is unambiguous.</t>

        <dl>
          <dt>Type (8 bits):</dt>
          <dd>ICMPv6 message type. A new value is to be assigned
          by IANA from the informational message range (128-255).
          See <xref target="iana"/>.</dd>

          <dt>Code (8 bits):</dt>
          <dd>Message sub-type. Value 0 indicates a flow-level
          Long-haul CNP. Other values are reserved for future
          definition. See <xref target="iana-code"/> for the
          registration policy.</dd>

          <dt>Checksum (16 bits):</dt>
          <dd>Standard ICMPv6 checksum as specified in
          <xref target="RFC4443"/>.</dd>

          <dt>Congestion Level (8 bits):</dt>
          <dd>An unsigned integer indicating the severity of
          congestion, where 0 indicates no congestion and 255
          indicates maximum severity. This value is determined by
          the generating node based on its local congestion
          assessment.</dd>

          <dt>Action Flags (8 bits):</dt>
          <dd>The upper 2 bits encode the primary action:
          00 = Notify, 01 = Pause, 10 = Rate Reduce,
          11 = Resume. The lower 6 bits are reserved and MUST be
          set to zero by senders. Receivers MUST ignore the
          reserved bits.</dd>

          <dt>Parameter (16 bits):</dt>
          <dd>Semantics depend on Action Flags. For Rate Reduce:
          rate reduction percentage (0-100). For Pause: pause
          duration in microseconds. For Resume: rate recovery
          percentage (0-100), where 0 indicates unconditional
          resume to the original rate. For Notify: unused, MUST be
          set to zero.</dd>

          <dt>Source QP Number (32 bits):</dt>
          <dd>The QP number at the traffic source that should
          apply the indicated action.</dd>

          <dt>Metric Type (8 bits):</dt>
          <dd>Identifies the type of the Congestion Metric Value
          field. Defined values are: 0 = Unspecified
          (implementation-defined semantics, for backward
          compatibility), 1 = Queue Depth in kilobytes,
          2 = Queue Growth Rate in kilobytes per millisecond,
          3 = ECN Marking Rate as a percentage (0-100),
          4 = RTT-based metric in microseconds, 5-253 =
          Unassigned (see <xref target="iana-metric-type"/>),
          254-255 = Experimental. When Metric Type is 0, receivers
          SHOULD treat the Congestion Metric Value as opaque
          context.</dd>

          <dt>Congestion Metric Value (24 bits):</dt>
          <dd>An unsigned integer whose semantics are determined by
          the Metric Type field. This field provides additional
          context for source-side decision-making. When the
          generating node does not wish to disclose queue state
          information, both Metric Type and Congestion Metric Value
          MUST be set to zero.</dd>
        </dl>

        <section anchor="icmpv6-extension-objects" numbered="true"
                 toc="default">
          <name>Optional Extension Objects</name>

          <t>Zero or more extension objects MAY follow the
          fixed-length body. When extension objects are present,
          they MUST be preceded by an Extension Header as defined
          in Section 3 of <xref target="RFC4884"/>, formatted
          as follows:</t>

          <figure anchor="fig-ext-header">
            <name>Extension Header Format</name>
            <artwork type="ascii-art"><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Version (4)  |   Reserved    |         Checksum              |
+---------------+---------------+-------------------------------+
]]></artwork>
          </figure>

          <dl>
            <dt>Version (4 bits):</dt>
            <dd>MUST be set to 2, as specified in Section 3 of
            <xref target="RFC4884"/>.</dd>
            <dt>Reserved (12 bits):</dt>
            <dd>MUST be set to zero.</dd>
            <dt>Checksum (16 bits):</dt>
            <dd>The one's complement of the one's complement sum
            of the Extension Header and all Extension Objects,
            computed as specified in Section 3 of
            <xref target="RFC4884"/>.</dd>
          </dl>

          <t>Each Extension Object following the Extension Header
          uses the object header format defined in Section 4 of
          <xref target="RFC4884"/>:</t>

          <figure anchor="fig-ext-object">
            <name>Extension Object Format</name>
            <artwork type="ascii-art"><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|            Length             |  Class-Num    |    C-Type     |
+-------------------------------+---------------+---------------+
|                  Object Payload (variable)                    |
~                                                               ~
+---------------------------------------------------------------+
]]></artwork>
          </figure>

          <dl>
            <dt>Length (16 bits):</dt>
            <dd>Total length of the Extension Object in octets,
            including the 4-octet object header. As specified in
            <xref target="RFC4884"/>, each Extension Object MUST
            be zero-padded to a 4-octet boundary. The Length field
            indicates the actual (unpadded) length; receivers MUST
            use the padded length when advancing to the next
            Extension Object.</dd>
            <dt>Class-Num (8 bits):</dt>
            <dd>Identifies the class of the Extension Object.
            A new Class-Num is to be assigned by IANA (see
            <xref target="iana"/>).</dd>
            <dt>C-Type (8 bits):</dt>
            <dd>Identifies the sub-type within the class.
            Defined C-Types include: 0 = Reserved, 1 = Timestamp
            (8-octet NTP timestamp), 2 = Device Identifier
            (variable-length UTF-8 string, padded to 4-octet
            boundary), 3 = Path Identifier (variable-length opaque
            value, padded to 4-octet boundary).</dd>
          </dl>

          <t>When no extension objects are present, the Extension
          Header MUST be omitted entirely. Receivers determine the
          presence of extension objects by checking whether the
          ICMPv6 message length exceeds the fixed body length
          (12 octets beyond the standard 4-octet ICMPv6
          header).</t>
        </section>
      </section>

      <section anchor="rocev2-format" numbered="true" toc="default">
        <name>RoCEv2 Backward-Compatible Extension Format</name>

        <t>This format achieves backward compatibility by reusing
        the standard CNP BTH Opcode (0x81) and extending the
        packet through a reserved bit in the BTH. This approach
        avoids the need for IETF to request a new Opcode from the
        InfiniBand Trade Association (IBTA), while ensuring that
        legacy endpoints that do not support Long-haul CNP will
        still process the packet as a standard CNP and apply their
        default rate reduction behavior.</t>

        <section anchor="rocev2-ebit" numbered="true"
                 toc="default">
          <name>Extension Present (E) Bit Definition</name>

          <t>In the standard RoCEv2 BTH, the 6-bit field
          immediately following the BECN (Backward Explicit
          Congestion Notification) bit is reserved and MUST be
          set to zero per the current RoCEv2 specification.
          This document proposes to the IBTA the definition of
          the most significant bit of this 6-bit reserved field
          as the Extension Present (E) bit:</t>

          <dl>
            <dt>E = 0:</dt>
            <dd>Standard CNP. No extension fields follow the
            BTH. The packet is processed as a conventional
            RoCEv2 CNP.</dd>
            <dt>E = 1:</dt>
            <dd>Long-haul CNP. Extension fields carrying
            congestion control instructions follow the BTH.
            The packet retains all standard CNP BTH field
            values (Opcode=0x81, BECN=1) and is fully parseable
            as a standard CNP by legacy endpoints.</dd>
          </dl>

          <t>The E bit definition is a proposal for IBTA
          consideration. This bit resides within a reserved field
          that is under IBTA governance, and formal allocation
          requires IBTA approval. Prior to such approval,
          implementations MUST NOT deploy this format in
          environments where non-participating endpoints or
          intermediate nodes may be present, as legacy devices
          that validate the reserved field as zero may reject
          packets with E=1. See <xref target="iana-rocev2"/>
          for further coordination details.</t>
        </section>

        <section anchor="rocev2-packet-layout" numbered="true"
                 toc="default">
          <name>Packet Layout</name>

          <figure anchor="fig-rocev2">
            <name>Long-haul CNP in RoCEv2 Extension Format</name>
            <artwork type="ascii-art"><![CDATA[
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    MAC Header (112 bits)                      |
+---------------------------------------------------------------+
|              IPv4/IPv6 Header (160/320 bits)                  |
+---------------------------------------------------------------+
|               UDP Header (64 bits, DstPort=4791)              |
+---------------------------------------------------------------+
 BTH (Base Transport Header, 96 bits):
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| OpCode (0x81) |S|M|Pad| TVer  |       Partition Key           |
+---------------+-+-+---+-------+-------------------------------+
|F|B|E| RSV(5b) |              DestQP (24 bits)                 |
+-+-+-+-+-+-+-+-+-----------------------------------------------+
|A|  RSV (7b)   |              PSN (24 bits)                    |
+-+-------------+-----------------------------------------------+
 Long-haul CNP Extension Fields (present when E=1):
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Cong. Level   | Action Flags  |         Parameter             |
+---------------+---------------+-------------------------------+
|                   Source QP Number (32 bits)                   |
+---------------------------------------------------------------+
|  Metric Type  |     Congestion Metric Value (24 bits)         |
+---------------+-----------------------------------------------+
|                   Optional Extension Objects (variable)        |
~                                                               ~
+---------------------------------------------------------------+
|                      ICRC (32 bits)                           |
+---------------------------------------------------------------+
|                      FCS  (32 bits)                           |
+---------------------------------------------------------------+
]]></artwork>
          </figure>
        </section>

        <section anchor="rocev2-bth-fields" numbered="true"
                 toc="default">
          <name>BTH Field Values for Long-haul CNP</name>

          <t>When generating a Long-haul CNP in RoCEv2 format,
          the congestion-aware intermediate node MUST set the BTH
          fields as follows:</t>

          <dl>
            <dt>OpCode (8 bits):</dt>
            <dd>MUST be set to 0x81 (binary 10000001), the
            standard CNP opcode as defined in the RoCEv2
            specification <xref target="RoCEv2"/>.</dd>

            <dt>Solicited Event - SE (1 bit):</dt>
            <dd>MUST be set to 0.</dd>

            <dt>MigReq - M (1 bit):</dt>
            <dd>MUST be set to 0.</dd>

            <dt>Pad Count - PadCnt (2 bits):</dt>
            <dd>MUST be set to 0.</dd>

            <dt>Transport Header Version - TVer (4 bits):</dt>
            <dd>MUST be set to 0x0.</dd>

            <dt>Partition Key - P_KEY (16 bits):</dt>
            <dd>MUST be set to 0xFFFF, or the partition key
            value of the target flow if partition-aware
            operation is required.</dd>

            <dt>FECN - F (1 bit):</dt>
            <dd>MUST be set to 0.</dd>

            <dt>BECN - B (1 bit):</dt>
            <dd>MUST be set to 1, indicating backward congestion
            notification. This is the standard CNP BECN
            setting.</dd>

            <dt>Extension Present - E (1 bit):</dt>
            <dd>MUST be set to 1 for Long-haul CNP. This bit
            indicates that Long-haul CNP extension fields follow
            the BTH.</dd>

            <dt>Reserved (5 bits):</dt>
            <dd>MUST be set to 0.</dd>

            <dt>DestQP (24 bits):</dt>
            <dd>The destination QP number at the source to be
            controlled. In standard CNP semantics, this identifies
            the QP that should reduce its sending rate.</dd>

            <dt>Acknowledge Request - A (1 bit):</dt>
            <dd>MUST be set to 0.</dd>

            <dt>Reserved (7 bits):</dt>
            <dd>MUST be set to 0.</dd>

            <dt>PSN (24 bits):</dt>
            <dd>MUST be set to 0, consistent with standard CNP
            behavior.</dd>
          </dl>
        </section>

        <section anchor="rocev2-ext-fields" numbered="true"
                 toc="default">
          <name>Extension Fields</name>

          <t>The extension fields immediately follow the BTH when
          E=1. Their encoding is consistent with the ICMPv6
          format:</t>

          <dl>
            <dt>Congestion Level (8 bits):</dt>
            <dd>Encoding consistent with
            <xref target="icmpv6-format"/>.</dd>

            <dt>Action Flags (8 bits):</dt>
            <dd>Encoding consistent with
            <xref target="icmpv6-format"/>.</dd>

            <dt>Parameter (16 bits):</dt>
            <dd>Encoding consistent with
            <xref target="icmpv6-format"/>.</dd>

            <dt>Source QP Number (32 bits):</dt>
            <dd>Source-side QP number for precise flow
            identification. This field complements the DestQP
            in the BTH to form the complete QP pair
            identification.</dd>

            <dt>Metric Type (8 bits):</dt>
            <dd>Encoding consistent with
            <xref target="icmpv6-format"/>.</dd>

            <dt>Congestion Metric Value (24 bits):</dt>
            <dd>Encoding consistent with
            <xref target="icmpv6-format"/>.</dd>

            <dt>Optional Extension Objects (variable):</dt>
            <dd>Zero or more Extension Objects, using the same
            Extension Header and Extension Object format as
            defined in <xref target="icmpv6-extension-objects"/>.
            When present, the Extension Header MUST precede the
            first Extension Object.</dd>

            <dt>ICRC (32 bits):</dt>
            <dd>Invariant CRC as specified in the RoCEv2
            specification. The ICRC computation MUST include
            the extension fields and any Extension Objects.
            See <xref target="rocev2-icrc-compat"/> for ICRC
            compatibility analysis.</dd>

            <dt>FCS (32 bits):</dt>
            <dd>Ethernet Frame Check Sequence.</dd>
          </dl>
        </section>

        <section anchor="rocev2-icrc-compat" numbered="true"
                 toc="default">
          <name>ICRC Compatibility Analysis</name>

          <t>In the RoCEv2 specification, the ICRC is computed
          over all bytes from the beginning of the BTH to the
          byte immediately preceding the ICRC field itself (with
          certain IP and UDP header fields replaced by defined
          values). When a Long-haul CNP is constructed with E=1,
          the extension fields and any Optional Extension Objects
          are placed between the BTH and the ICRC field. Therefore,
          the ICRC computation naturally covers the extension
          data.</t>

          <t>A legacy RNIC that receives a Long-haul CNP will
          compute the ICRC over the same byte range (BTH through
          the byte preceding the ICRC field). Because the ICRC
          is always located at a fixed offset from the end of
          the Ethernet frame (immediately before the FCS), the
          legacy RNIC will include the extension fields in its
          ICRC computation even though it does not parse them.
          Consequently, the ICRC verification will succeed on
          legacy endpoints, and no ICRC mismatch will occur
          due to the presence of extension fields.</t>
        </section>
      </section>
    </section>

    <section anchor="example" numbered="true" toc="default">
      <name>Operational Example</name>

      <t>Consider a DCI path: Source (DC-A) -> N1 -> N2 -> Dest
      (DC-B), where N1 and N2 are congestion-aware intermediate
      nodes, and the WAN RTT is 10 ms.</t>

      <figure anchor="fig-example">
        <name>Operational Example Topology</name>
        <artwork type="ascii-art"><![CDATA[
  +--------+    +----+    +----+    +--------+
  | Source  |--->| N1 |--->| N2 |--->|  Dest  |
  | DC-A   |    |    |    |    |    |  DC-B  |
  +--------+    +----+    +----+    +--------+
     10.0.0.1            10.0.0.4
     QP=100              QP=200
]]></artwork>
      </figure>

      <ol>
        <li>N1 learns the flow: {Src=10.0.0.1, Dst=10.0.0.4,
        SrcQP=100, DstQP=200}.</li>

        <li>N1 calculates K_max for its egress port (100 Gbps link):
        K_max = max(64KB, 1.0 * 100e9 * 0.010 / 8) = max(64KB,
        125MB) = 125 MB. K_min = 62.5 MB.</li>

        <li>Queue depth at N1 reaches 70 MB (exceeds K_min). N1
        applies ECN marking to traversing packets (first-level
        response).</li>

        <li>Queue depth continues to grow to 130 MB (exceeds K_max).
        N1 generates a Long-haul CNP: {Action=Rate Reduce,
        Parameter=30, SourceQP=100, CongLevel=180,
        MetricType=1 (Queue Depth in KB), MetricValue=130000}
        and unicasts it to 10.0.0.1. In RoCEv2 format, the BTH
        uses Opcode=0x81, BECN=1, E=1, DestQP=100.</li>

        <li>Source receives the Long-haul CNP within approximately
        5 ms (half RTT). If the source supports Long-haul CNP, it
        checks E=1 and parses the extension fields, reducing QP
        100's sending rate by 30%. If the source is a legacy RNIC
        that does not recognize the E bit, it processes the packet
        as a standard CNP and applies its default rate reduction
        algorithm (see <xref target="backward-compat"/> for
        frame length considerations).</li>

        <li>N2 also detects mild congestion but observes that the
        flow arrival rate from N1 has decreased. N2 defers
        Long-haul CNP generation.</li>

        <li>After 20 ms without new Long-haul CNP, the source
        begins additive rate recovery on QP 100.</li>

        <li>N1 observes queue depth has fallen below K_min for
        more than 10 ms (1 * RTT_est). N1 generates a Long-haul
        CNP: {Action=Resume, Parameter=50, SourceQP=100,
        CongLevel=20, MetricType=1, MetricValue=30000}. The source
        increases QP 100's rate by 50% of the previously applied
        reduction.</li>
      </ol>
    </section>

    <section anchor="backward-compat" numbered="true" toc="default">
      <name>Backward Compatibility</name>

      <t>The Long-haul CNP mechanism is designed for incremental
      deployment:</t>

      <dl>
        <dt>Non-supporting intermediate nodes:</dt>
        <dd>Intermediate nodes that do not implement this
        specification simply forward data packets without generating
        Long-haul CNP. The standard ECN/CNP feedback loop continues
        to operate as the baseline congestion control
        mechanism.</dd>

        <dt>Non-supporting sources (ICMPv6 format):</dt>
        <dd>If a source receives an ICMPv6 Long-haul CNP but does
        not recognize the Type value, it MUST process it according
        to standard ICMPv6 unknown-type handling rules
        <xref target="RFC4443"/>. For informational messages
        (Type values 128-255), the unknown message is silently
        discarded.</dd>

        <dt>Non-supporting sources (RoCEv2 format):</dt>
        <dd><t>Because the Long-haul CNP in RoCEv2 format uses the
        standard CNP Opcode (0x81) with all mandatory BTH fields
        set to their standard CNP values, a legacy RNIC that does
        not recognize the E bit will process the packet as a
        standard CNP. The legacy RNIC will ignore the extension
        fields (which appear after the expected CNP boundary) and
        apply its default rate reduction behavior. This provides a
        graceful degradation path: precise control for supporting
        endpoints, and standard CNP rate reduction for legacy
        endpoints.</t>
        <t>However, the Long-haul CNP frame is larger than a
        standard CNP frame due to the extension fields (at least
        12 additional octets for the base extension, plus any
        Optional Extension Objects). Legacy RNIC implementations
        that perform strict frame length validation against the
        expected standard CNP size may reject the Long-haul CNP
        packet. Deployments SHOULD verify that legacy endpoints
        in the network tolerate CNP frames with additional
        trailing data beyond the standard BTH before enabling
        the RoCEv2 Long-haul CNP format. In environments where
        legacy endpoints are known to perform strict length
        validation, the ICMPv6 format SHOULD be used instead,
        or all endpoints should be upgraded to support the
        Long-haul CNP extension.</t></dd>

        <dt>Mixed deployment:</dt>
        <dd>In networks where only some intermediate nodes support
        this specification, the supporting nodes generate Long-haul
        CNP while non-supporting nodes rely on ECN marking alone.
        The multi-level framework degrades gracefully to standard
        ECN/CNP behavior in portions of the path without
        Long-haul CNP capability.</dd>
      </dl>
    </section>

    <section anchor="security" numbered="true" toc="default">
      <name>Security Considerations</name>

      <t>Long-haul CNP packets carry control instructions that
      directly affect the source's sending behavior. The following
      security considerations apply:</t>

      <dl>
        <dt>Packet Forgery:</dt>
        <dd>A malicious entity could forge Long-haul CNP packets to
        cause a source to reduce its rate or pause transmission,
        resulting in denial of service. To mitigate this, sources
        SHOULD validate that the IP source address of received
        Long-haul CNP packets belongs to a configured set of known
        congestion-aware intermediate node addresses. For the ICMPv6
        format, IPsec Authentication Header (AH) MAY be used to
        provide packet authentication. For the RoCEv2 format, the
        ICRC provides integrity protection but not authentication;
        additional authentication mechanisms are RECOMMENDED in
        security-sensitive deployments.</dd>

        <dt>Amplification:</dt>
        <dd>A single congested packet could potentially trigger
        Long-haul CNP generation targeting multiple sources.
        Congestion-aware intermediate nodes MUST rate-limit
        Long-haul CNP generation on a per-flow basis (RECOMMENDED
        minimum interval: RTT_est per flow) and MUST impose a
        global rate limit on total Long-haul CNP output per
        port.</dd>

        <dt>Information Disclosure:</dt>
        <dd>The Congestion Metric Value field reveals internal queue
        state information. In deployments where this is considered
        sensitive, both the Metric Type and Congestion Metric Value
        fields MUST be set to zero while still providing actionable
        control via the Action Flags and Parameter fields.</dd>

        <dt>Reserved Bit Manipulation (RoCEv2 format):</dt>
        <dd>A malicious entity that can modify packets in transit
        could set the E bit on standard CNP packets and append
        forged extension fields. The ICRC field provides integrity
        protection against in-transit modification for
        RoCEv2 packets. Additionally, sources SHOULD validate that
        the combination of extension field values is consistent
        before applying control actions.</dd>

        <dt>Resume Action Abuse:</dt>
        <dd>A malicious entity could forge Long-haul CNP packets
        with the Resume action to cause a source to prematurely
        increase its sending rate during actual congestion. The
        source address validation described under Packet Forgery
        above mitigates this risk. Additionally, sources SHOULD
        apply a maximum rate increase cap per Resume action to
        limit the impact of any single forged Resume
        instruction.</dd>
      </dl>
    </section>

    <section anchor="iana" numbered="true" toc="default">
      <name>IANA Considerations</name>

      <section anchor="iana-icmpv6" numbered="true" toc="default">
        <name>ICMPv6 Type Allocation</name>

        <t>This document requests IANA to allocate a new value from
        the "ICMPv6 'type' Numbers" registry for the Long-haul CNP
        message type. The value SHOULD be allocated from the
        informational message range (128-255).</t>

        <table>
          <thead>
            <tr>
              <th>Type</th>
              <th>Name</th>
              <th>Reference</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>TBD1</td>
              <td>Long-haul Congestion Notification</td>
              <td>[This Document]</td>
            </tr>
          </tbody>
        </table>
      </section>

      <section anchor="iana-code" numbered="true" toc="default">
        <name>ICMPv6 Code Values</name>

        <t>This document requests IANA to create a sub-registry
        titled "Long-haul Congestion Notification Code Values"
        under the "ICMPv6 'type' Numbers" registry, for Code
        values associated with the ICMPv6 Type allocated in
        <xref target="iana-icmpv6"/>. The initial contents of
        this sub-registry are:</t>

        <table>
          <thead>
            <tr>
              <th>Code</th>
              <th>Name</th>
              <th>Reference</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>0</td>
              <td>Flow-level Long-haul CNP</td>
              <td>[This Document]</td>
            </tr>
            <tr>
              <td>1-253</td>
              <td>Unassigned</td>
              <td></td>
            </tr>
            <tr>
              <td>254-255</td>
              <td>Experimental</td>
              <td>[This Document]</td>
            </tr>
          </tbody>
        </table>

        <t>New Code values in the range 1-253 are to be assigned
        via Specification Required policy
        <xref target="RFC8126"/>.</t>
      </section>

      <section anchor="iana-classnum" numbered="true"
               toc="default">
        <name>ICMP Extension Object Class-Num</name>

        <t>This document requests IANA to allocate a new Class-Num
        value from the "ICMP Extension Object Classes and Class
        Sub-types" registry established by
        <xref target="RFC4884"/>.</t>

        <table>
          <thead>
            <tr>
              <th>Class-Num</th>
              <th>Class Name</th>
              <th>Reference</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>TBD2</td>
              <td>Long-haul CNP Extension</td>
              <td>[This Document]</td>
            </tr>
          </tbody>
        </table>

        <t>Within this Class-Num, the following C-Type values are
        defined:</t>

        <table>
          <thead>
            <tr>
              <th>C-Type</th>
              <th>Name</th>
              <th>Reference</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>0</td>
              <td>Reserved</td>
              <td>[This Document]</td>
            </tr>
            <tr>
              <td>1</td>
              <td>Timestamp</td>
              <td>[This Document]</td>
            </tr>
            <tr>
              <td>2</td>
              <td>Device Identifier</td>
              <td>[This Document]</td>
            </tr>
            <tr>
              <td>3</td>
              <td>Path Identifier</td>
              <td>[This Document]</td>
            </tr>
            <tr>
              <td>4-253</td>
              <td>Unassigned</td>
              <td></td>
            </tr>
            <tr>
              <td>254-255</td>
              <td>Experimental</td>
              <td>[This Document]</td>
            </tr>
          </tbody>
        </table>

        <t>New C-Type values in the range 4-253 are to be assigned
        via Specification Required policy
        <xref target="RFC8126"/>.</t>
      </section>

      <section anchor="iana-metric-type" numbered="true"
               toc="default">
        <name>Congestion Metric Type Values</name>

        <t>This document requests IANA to create a new registry
        titled "Long-haul CNP Congestion Metric Type Values". The
        initial contents of this registry are:</t>

        <table>
          <thead>
            <tr>
              <th>Value</th>
              <th>Name</th>
              <th>Unit</th>
              <th>Reference</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>0</td>
              <td>Unspecified</td>
              <td>N/A</td>
              <td>[This Document]</td>
            </tr>
            <tr>
              <td>1</td>
              <td>Queue Depth</td>
              <td>Kilobytes</td>
              <td>[This Document]</td>
            </tr>
            <tr>
              <td>2</td>
              <td>Queue Growth Rate</td>
              <td>Kilobytes/ms</td>
              <td>[This Document]</td>
            </tr>
            <tr>
              <td>3</td>
              <td>ECN Marking Rate</td>
              <td>Percentage (0-100)</td>
              <td>[This Document]</td>
            </tr>
            <tr>
              <td>4</td>
              <td>RTT-based Metric</td>
              <td>Microseconds</td>
              <td>[This Document]</td>
            </tr>
            <tr>
              <td>5-253</td>
              <td>Unassigned</td>
              <td></td>
              <td></td>
            </tr>
            <tr>
              <td>254-255</td>
              <td>Experimental</td>
              <td></td>
              <td>[This Document]</td>
            </tr>
          </tbody>
        </table>

        <t>New values in the range 5-253 are to be assigned via
        Specification Required policy
        <xref target="RFC8126"/>.</t>
      </section>

      <section anchor="iana-rocev2" numbered="true" toc="default">
        <name>RoCEv2 Reserved Bit Coordination</name>

        <t>The RoCEv2 Long-haul CNP format defined in this document
        proposes the use of the most significant bit of the 6-bit
        reserved field following the BECN bit in the BTH as an
        Extension Present (E) bit. The RoCEv2 BTH format is defined
        by the InfiniBand Trade Association (IBTA), and the reserved
        field is under IBTA governance.</t>

        <t>This document respectfully requests that the IBTA
        consider allocating this bit as the "Long-haul Extension
        Present" indicator for CNP packets (Opcode 0x81). The E bit
        definition specified in this document is a proposal intended
        to facilitate IBTA review; it does not constitute a
        unilateral allocation by the IETF of IBTA-governed
        protocol space.</t>

        <t>Until IBTA formally approves this allocation,
        implementations of the RoCEv2 format defined in this
        document are considered experimental and MUST only be
        deployed in controlled environments where all endpoints
        and intermediate nodes are known to support this
        extension. Specifically, implementations MUST NOT send
        Long-haul CNP packets in RoCEv2 format to endpoints that
        have not been explicitly configured or negotiated to
        accept them.</t>
      </section>
    </section>
  </middle>

  <back>
    <references>
      <name>References</name>

      <references>
        <name>Normative References</name>
        &RFC2119;
        &RFC3168;
        &RFC4443;
        &RFC4884;
        &RFC8174;

        <reference anchor="RFC8126">
          <front>
            <title>Guidelines for Writing an IANA Considerations
            Section in RFCs</title>
            <author surname="Cotton" initials="M."
                    fullname="Michelle Cotton"/>
            <author surname="Leiba" initials="B."
                    fullname="Barry Leiba"/>
            <author surname="Narten" initials="T."
                    fullname="Thomas Narten"/>
            <date year="2017" month="June"/>
          </front>
          <seriesInfo name="BCP" value="26"/>
          <seriesInfo name="RFC" value="8126"/>
        </reference>
      </references>

      <references>
        <name>Informative References</name>

        <reference anchor="I-D.xiao-rtgwg-rocev2-fast-cnp">
          <front>
            <title>Fast Congestion Notification Packet (CNP) in
            RoCEv2 Networks</title>
            <author surname="Min" initials="X."
                    fullname="Xiao Min"/>
            <author surname="Li" initials="H."
                    fullname="Hesong Li"/>
            <author surname="Zhang" initials="K."
                    fullname="Ke Zhang"/>
            <author surname="Cheng" initials="W."
                    fullname="Weiqiang Cheng"/>
            <author surname="Yang" initials="J."
                    fullname="Jin Yang"/>
            <author surname="Zhang" initials="K."
                    fullname="Kan Zhang"/>
            <date year="2025" month="December"/>
          </front>
          <seriesInfo name="Internet-Draft"
                      value="draft-xiao-rtgwg-rocev2-fast-cnp-04"/>
        </reference>

        <reference anchor="I-D.liu-spring-srv6-cc">
          <front>
            <title>Congestion Control Based on SRv6 Path</title>
            <author surname="Liu" initials="Y."
                    fullname="Yisong Liu"/>
            <author surname="Shi" initials="H."
                    fullname="Hang Shi"/>
            <date year="2025" month="July"/>
          </front>
          <seriesInfo name="Internet-Draft"
                      value="draft-liu-spring-srv6-cc-01"/>
        </reference>

        <reference anchor="I-D.hu-rtgwg-rocev2-fcn">
          <front>
            <title>Fast congestion notification for distributed
            RoCEv2 network based on SRv6</title>
            <author surname="Hu" initials="Z."
                    fullname="Zehua Hu"/>
            <author surname="Zhu" initials="Y."
                    fullname="Yongqing Zhu"/>
            <author surname="Geng" initials="X."
                    fullname="Xuesong Geng"/>
            <date year="2025" month="March"/>
          </front>
          <seriesInfo name="Internet-Draft"
                      value="draft-hu-rtgwg-rocev2-fcn-00"/>
        </reference>

        <reference anchor="DCQCN">
          <front>
            <title>Congestion Control for Large-Scale RDMA
            Deployments</title>
            <author surname="Zhu" initials="Y."
                    fullname="Yibo Zhu"/>
            <date year="2015"/>
          </front>
          <seriesInfo name="ACM" value="SIGCOMM"/>
        </reference>

        <reference anchor="TIMELY">
          <front>
            <title>TIMELY: RTT-based Congestion Control for the
            Datacenter</title>
            <author surname="Mittal" initials="R."
                    fullname="Radhika Mittal"/>
            <date year="2015"/>
          </front>
          <seriesInfo name="ACM" value="SIGCOMM"/>
        </reference>

        <reference anchor="RoCEv2">
          <front>
            <title>Supplement to InfiniBand Architecture
            Specification Volume 1 Release 1.2.1 - Annex A17:
            RoCEv2</title>
            <author>
              <organization>InfiniBand Trade
              Association</organization>
            </author>
            <date year="2014"/>
          </front>
        </reference>

        &RFC8754;
      </references>
    </references>
  </back>
</rfc>
