<?xml version="1.0" encoding="utf-8"?>
<?xml-model href="rfc7991bis.rnc"?>  <!-- Required for schema validation and schema-aware editing -->
<!-- <?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?> -->
<!-- This third-party XSLT can be enabled for direct transformations in XML processors, including most browsers -->


<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<!-- If further character entities are required then they should be added to the DOCTYPE above.
     Use of an external entity file is not recommended. -->

<rfc
  xmlns:xi="http://www.w3.org/2001/XInclude"
  category="info"
  docName="draft-hu-rtgwg-pre-ecn-wan-01"
  ipr="trust200902"
  obsoletes=""
  updates=""
  submissionType="IETF"
  xml:lang="en"
  version="3">

  <front>
    <title abbrev="draft-hu-rtgwg-pre-ecn-wan-01"> Precise ECN in WAN
    </title>
    <!--  [REPLACE/DELETE] abbrev. The abbreviated title is required if the full title is longer than 39 characters -->

    <seriesInfo name="Internet-Draft" value="draft-hu-rtgwg-pre-ecn-wan-01"/>
   
    <author fullname="Jiayuan Hu" initials="Jiayuan" role="editor" surname="Hu">
      <organization>China Telecom</organization>
      <address>
        <postal>
          <street>109, West Zhongshan Road, Tianhe District</street>
          <city>Guangzhou</city>
          <region>Guangzhou</region>
          <code>510000</code>
          <country>CN</country>
        </postal>
        <email>hujy5@chinatelecom.cn</email>
      </address>
    </author>

    <date year="2026"/>

    <area>Routing</area>
    <workgroup>Routing Area Working Group</workgroup>
    <!-- "Internet Engineering Task Force" is fine for individual submissions.  If this element is 
          not present, the default is "Network Working Group", which is used by the RFC Editor as 
          a nod to the history of the RFC Series. -->

    <keyword>RFC</keyword>
    <!-- [REPLACE/DELETE]. Multiple allowed.  Keywords are incorporated into HTML output files for 
         use by search engines. -->

    <abstract>
      <t>This draft defines the precise ECN during used in WAN. With the growing demand for AI computing power,
          the computational capacity of a single Artificial Intelligence Data Center (AIDC) can no longer meet the
          requirements of large-scale model training. This has led to the emergence of cross-AIDC distributed model
          training, driving the need for transmitting RoCEv2 packets over WAN networks. AI training is highly sensitive
          to network packet loss, where even minimal packet loss can significantly degrade training efficiency.
          Additionally, elephant flows and extreme concurrent traffic impose higher demands on network performance.</t>
      <t>
          ECN achieves active feedback of network congestion by setting ECN flag bits in the header of IP packets, which
          is an effective traffic control method. RFC6040 introduces the application of ECN in WAN.
          However, due to the much higher end-to-end delay in WAN than in DC, and the frequent occurrence
          of instantaneous traffic bursts in WAN, it is easy to trigger ECN at the wrong time. This draft
          focuses on the precise use of ECN in WAN, by introducing different reactions of ECN in different WAN
          transmission scenarios</t>
    </abstract>
 
  </front>

  <middle>
    
    <section>
      <name>Introduction</name>
      <t>
        The rapid growth of AI computing power, particularly for large-scale model training, has necessitated
          distributed training across multiple Artificial Intelligence Data Centers (AIDCs). This shift has increased
          the demand for reliable and high-performance transmission of RoCEv2 (RDMA over Converged Ethernet version 2)
          traffic over the WAN. However, AI workloads are highly sensitive to network congestion and packet loss, even
          minor packet drops can significantly degrade training efficiency. Due to the long links and significant
          end-to-end latency in wide area networks, traditional congestion control mechanisms may not be effective in a
          timely manner. They are insufficient for AI workloads due to their reactive nature and inability to guarantee
          zero packet loss.
      </t>
      <t>
        To address these challenges, this draft explores the precise utilization of Explicit Congestion Notification (ECN)
          in WAN environments, particularly for RoCEv2 over IP tunnels. ECN enables proactive congestion signaling by
          marking packets instead of dropping them, allowing endpoints to adjust transmission rates before congestion
          escalates. However, traditional ECN implementations face challenges in WAN scenarios, including inconsistent
          ECN propagation across tunnel boundaries and inefficient congestion response mechanisms. This work focuses on
          optimizing ECN for lossless RoCEv2 transmission in WANs by:
      </t>
      <t>
        1. Ensuring Accurate ECN Propagation: Defining rules for consistent ECN field handling across IP-in-IP tunnels to
          prevent packet loss.
      </t>
      <t>
        2. Enhancing Congestion Feedback: Adjust the sending rate within a small range of the wide area network to
          reduce the impact of latency on end-to-end communication.
      </t>
      <t>
        3. Supporting Multi-Level Congestion Signaling: Extending ECN to differentiate between varying congestion
        severities, improving responsiveness for AI traffic.
      </t>
      <t>
        By refining ECN mechanisms for WAN environments, this approach enhances network efficiency for distributed AI
        training while maintaining backward compatibility with existing protocols. The proposed framework provides a
        scalable and reliable solution for future large-scale distributed computing applications.
      </t>
    </section>
      
    <section title="Conventions Used in This Document">
      <section>
        <name>Requirements Language</name>
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
          "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT
          RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
          interpreted as described in BCP 14 <xref target="RFC2119"/>
          <xref target="RFC8174"/> when, and only when, they appear in
          all capitals, as shown here.</t>
      </section>

      <section title="Abbreviations">
        <t> AIDC: Artificial Intelligence Data Center</t>
        <t> RoCEv2: RDMA over Converged Ethernet version 2</t>
        <t> ECN: Explicit Congestion Notification</t>
        <t> CNP: Congestion Notification Packet</t>
      </section>
    </section>
      <!-- [CHECK] The 'Requirements Language' section is optional -->

    <section title="ECN for WAN">
      <section>
        <name>ECN Mechanism for WANs</name>
        <t>
          In WANs, tunneling is a fundamental technique used to encapsulate and transport data
          packets across different network domains while maintaining security, performance, and compatibility. Tunneling
          works by embedding an original packet (the inner payload) within a new packet (the outer header), allowing it
          to traverse intermediate networks that may not natively support the original protocol.
        </t>
        <t>
          ECN, as a traditional congestion notification mechanism, has also been extended from DC to WAN. <xref target="RFC6040"/>
          introduces how to label and use ECN mechanisms in tunnels, which are divided into tunnel ingress behavior and
          tunnel egress behavior. each behavior contain two encapsulation modes: a "compatibility mode," which is for
            backward compatibility with tunnel decapsulators that do not comprehend ECN, and a REQUIRED "normal mode."
            The detail of ingress behavior is shown below:
        </t>
        <figure>
          <name>New IP in IP Encapsulation Behaviours</name>
          <artwork align="center"><![CDATA[
+-----------------+------------------------------+
| Incoming Header |    Departing Outer Header    |
| (also equal to  +---------------+--------------+
| departing Inner | Compatibility |    Normal    |
|      Header)    |       Mode    |     Mode     |
+-----------------+---------------+--------------+
|     Not-ECT     |      Not-ECT  |    Not-ECT   |
|      ECT(0)     |      Not-ECT  |     ECT(0)   |
|      ECT(1)     |      Not-ECT  |     ECT(1)   |
|       CE        |      Not-ECT  |      CE      |
+-----------------+---------------+--------------+
            ]]>
          </artwork>
        </figure>
        <t>
          For the decapsulation behavior, detail is shown below:
        </t>
          <figure>
            <name>New IP in IP Decapsulation Behaviour</name>
            <artwork align="center"><![CDATA[
+---------+----------------------------------------------+
|Arriving |              Arriving Outer Header           |
| Inner   +---------+------------+------------+----------+
| Header  | Not-ECT |   ECT(0)   |   ECT(1)   |    CE    |
+---------+---------+------------+------------+----------+
| Not-ECT | Not-ECT |Not-ECT(!!!)|Not-ECT(!!!)|drop (!!!)|
|   drop  | ECT(0)  |  ECT(0)    |  light CE  |    CE    |
|   drop  |  ECT(1) | ECT(1) (!) |  light CE  |    CE    |
|    CE   |    CE   |     CE     |    CE(!!!) |    CE    |
+---------+---------+------------+------------+----------+
          ]]>
            </artwork>
          </figure>
        <t>
            ECT(0) and ECT(1) can both indicate the same degree of congestion marking (such as "not congestion marked")
            according to the reasoning above. However, it also makes it possible to construct future schemes in which
            ECT(1) can represent other situation in WAN scenario.
          </t>
      </section>
        <section>
          <name>Two-Threshold ECN Mechanism for WAM</name>
          <t>
            The high latency and bursty nature of WAN links introduce significant challenges for timely and accurate
            congestion management. Traditional single-threshold ECN or packet-drop mechanisms often result in delayed
            feedback, causing over-correction (global synchronization) or under-correction (persistent congestion). To
            address this, this draft proposes a Two-Threshold ECN Mechanism specifically designed for WAN environments
            carrying latency-sensitive, loss-averse traffic like RoCEv2 for AI training.
          </t>
          <section>
            <name>Mechanism Overview</name>
            <t>
              This mechanism redefines the use of the ECT(1) codepoint within a controlled domain (e.g., a provider's
              WAN core), leveraging the flexibility permitted by RFC 8311. Network devices (e.g., routers, switches)
              supporting this mechanism are configured with two queue occupancy thresholds: a Lower Threshold (T1) and
              a Higher Threshold (T2).
            </t>
            <t>
              ECT(1) as "Pre-Congestion" or "Early Warning" Signal: When the average queue length exceeds T1, the device
              interprets this as incipient or light congestion. Packets with an outer IP header ECN field of ECT(0) or
              ECT(1) are remarked to ECT(1) with a probability that increases linearly with the queue length. This ECT(1)
              marking is defined within this WAN domain as a pre-congestion notification (PCN). Its purpose is to signal
              an impending congestion condition before queues build to a level that would impact latency or cause loss.
            </t>
            <t>
              CE as "Severe Congestion" Signal: When the average queue length exceeds T2, the device interprets this as
              severe congestion requiring immediate action. Packets are marked with the CE codepoint following a
              standard RED-like algorithm. This signal mandates a direct and measurable reduction in the data sender's
              transmission rate.
            </t>
          </section>
          <section>
            <name>Congestion Notification Packet (CNP) and Ingress PE Action</name>
            <t>
              A critical component of this mechanism is the generation and processing of a Congestion Notification
              Packet (CNP). This is a control packet generated by the congested device (or a network controller
              monitoring it) and sent to the tunnel ingress Provider Edge (PE) device—the point where the traffic
              entered the WAN domain.
            </t>
            <section>
              <name>Upon reaching T1 (ECT(1) marking):</name>
              <t>
                The congested device generates and sends a CNP to the ingress PE. This CNP identifies the affected flow
                (e.g., via 5-tuple) and indicates a light congestion event.
              </t>
              <t>
                Ingress PE Action: Upon receiving this CNP, the ingress PE MAY take proactive measures to alleviate the
                impending congestion without involving the end host. This can include:
              </t>
              <t>
                Local Rate Adjustment: Slightly reducing the transmission rate for the identified flow into the tunnel.
              </t>
              <t>
                Traffic Rerouting: Dynamically steering the flow to an alternative, less congested path within the WAN
                if available (e.g., using SRv6 policy).
              </t>
              <t>
                ECN Propagation: Crucially, at this stage, the ingress PE does NOT copy the outer ECT(1) marking to the
                inner IP header during decapsulation (following a modified "pipe model" logic for this codepoint). The
                end host remains unaware of this early warning, preventing an over-reaction from a distant sender whose
                feedback loop is delayed by the WAN RTT.
              </t>
            </section>
            <section>
              <name>Upon reaching T2 (CE marking):</name>
              <t>
                The congested device generates and sends a CNP to the ingress PE, now indicating a severe congestion event.
              </t>
              <t>
                Ingress PE Action: The ingress PE MUST take action to ensure the end host's congestion control is
                engaged. It performs standard RFC 6040 "normal mode" decapsulation: the CE codepoint from the outer
                header is copied to the inner IP header.
              </t>
              <t>
                The packet, now with CE set in the inner header, is forwarded to the receiver. The receiver's transport
                (e.g., RoCEv2) then feeds this congestion signal back to the original sender, which MUST reduce its
                transmission rate according to its congestion control algorithm.
              </t>
            </section>
          </section>
          <section>
            <name>Deployment and Compatibility Considerations</name>
            <t>
              Backward Compatibility: Devices not implementing this two-threshold mechanism will treat ECT(1) as
              equivalent to ECT(0) per <xref target="RFC3168"/>, and will process CE normally. This ensures safe co-existence and
              incremental deployment.
            </t>
            <t>
              Domain of Application: This mechanism is intended for deployment within a managed WAN domain (e.g., a
              single provider's core). The re-semanting of ECT(1) is a local policy. At the egress PE leaving this
              domain, standard RFC 6040 rules apply for forwarding packets into external networks.
            </t>
            <t>
              Threshold Tuning: The values of T1 and T2 are critical and should be set based on link capacity, typical
              traffic profiles, and the desired latency-loss trade-off for the target applications (e.g., AI training).
              T1 should be set low enough to provide meaningful early warning but high enough to avoid triggering on
              transient micro-bursts.
            </t>
          </section>
        </section>
      </section>
    
    <section anchor="IANA">
    <!-- All drafts are required to have an IANA considerations section. See RFC 8126 for a guide.-->
      <name>IANA Considerations</name>
      <t>TBC</t>
    </section>
    
    <section anchor="Security">
      <!-- All drafts are required to have a security considerations section. See RFC 3552 for a guide. -->
      <name>Security Considerations</name>
      <t>
        The proposed enhancements introduce new mechanisms that must be evaluated for potential security
        vulnerabilities. This section expands upon the security considerations of <xref target="RFC3168"/> and <xref target="RFC6040"/> within the
        context of this draft.
      </t>
      <section>
        <name>Threats Related to the Two-Threshold Mechanism</name>
        <t>
          CNP Spoofing and Forgery: An attacker could generate malicious CNPs and send them to an ingress PE, falsely
          indicating congestion. This could trigger unnecessary rate reduction or rerouting, leading to denial-of-service
          (performance degradation) for legitimate flows or manipulation of traffic paths for interception.
        </t>
        <t>
          Threshold Manipulation: An on-path attacker with access to network device configuration could alter the T1 or
          T2 thresholds. Lowering T1 excessively would cause frequent ECT(1) marking and CNP generation, leading to
          under-utilization of the link. Raising T2 excessively could suppress legitimate CE signals, leading to
          bufferbloat and packet loss.
        </t>
        <t>
          ECN Field Tampering within the Tunnel: As noted in <xref target="RFC3168"/> and <xref target="RFC6040"/>, the outer ECN field is mutable. An
          attacker within the WAN could erase CE marks to hide congestion from the sender, or could set false CE/ECT(1)
          marks to artificially throttle flows. The two-threshold mechanism's use of ECT(1) as a significant signal
          creates a new vector for manipulation.
        </t>
      </section>
      <section>
        <name>Covert Channel Considerations</name>
        <t>
          <xref target="RFC6040"/> explicitly relaxed earlier restrictions on the covert channel bandwidth across tunnels, deeming a
          2-bit per packet channel manageable. This mechanism does not alter that fundamental assessment. However, the
          specific semantics where the ingress PE does not propagate ECT(1) outwards but does act on a CNP could
          theoretically be exploited by a colluding ingress and egress point to encode information. This is considered
          a manageable risk within a single administrative domain.
        </t>
      </section>
    </section>
    
    <!-- NOTE: The Acknowledgements and Contributors sections are at the end of this template -->
  </middle>

  <back>
    <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.6040.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.3168.xml"/>
        <!-- The recommended and simplest way to include a well known reference -->
        
      </references>
    </references>
    
    <section anchor="Contributors" numbered="false">
      <!-- [REPLACE/DELETE] a Contributors section is optional -->
      <name>Contributors</name>
      <t>Thanks to all the contributors.</t>
      <!-- [CHECK] it is optional to add a <contact> record for some or all contributors -->
    </section>
    
 </back>
</rfc>
