<?xml version="1.0" encoding="US-ASCII"?>
<!-- This template is for creating an Internet Draft using xml2rfc,
     which is available here: http://xml.resource.org. -->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
     There has to be one entity for each item to be referenced.
     An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC6241 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.6241.xml">
<!ENTITY RFC7950 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7950.xml">
<!ENTITY RFC7149 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7149.xml">
<!ENTITY RFC7426 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.7426.xml">
<!ENTITY RFC8299 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8299.xml">
<!ENTITY RFC8309 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8309.xml">
<!ENTITY RFC8340 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8340.xml">
<!ENTITY RFC8453 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8453.xml">
<!ENTITY RFC8174 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8174.xml">
<!ENTITY RFC8345 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.8345.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs),
     please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
     (Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space
     (using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="std" docName="draft-he-rtgwg-wan-pfc-00" ipr="trust200902">
  <front>
    <title>PFC PAUSE Frame Forwarded Transparently in Wide Area
    Networks</title>

    <author fullname="Xiaoming He" initials="X." surname="He">
      <organization>China Telecom</organization>

      <address>
        <email>hexm4@chinatelecom.cn</email>
      </address>
    </author>

    <author fullname="Lijie Deng" initials="L." surname="Deng">
      <organization>China Telecom</organization>

      <address>
        <email>denglj4@chinatelecom.cn</email>
      </address>
    </author>

    <date year="2026"/>

    <area>RTGWG</area>

    <workgroup>RTGWG Working Group</workgroup>

    <keyword>PFC PAUSE Frame in WAN</keyword>

    <abstract>
      <t>This document describes a solution for transparent forwarding of PFC
      PAUSE frames in wide area networks, which does not require the nodes in
      wide area networks to support PFC flow control capabilities.</t>
    </abstract>
  </front>

  <middle>
    <section anchor="Introduction" title="Introduction">
      <t>Remote Direct Memory Access (RDMA) is a method of accessing memory on
      a remote system without interrupting the processing of the Central
      Processing Unit (CPU) on that system. RDMA enables lower latency and
      higher throughput on the network and lower CPU utilization for the
      servers and storage systems. Currently, RoCEv2 (RDMA over Converged
      Ethernet Version 2) is widely deployed in lossless networks in
      intelligent computing centers, providing packet loss free data
      transmission services for high-performance computing (HPC) and AI model
      training and inference scenarios.</t>

      <t>With the rapid growth in demand for computing and storage resources
      in AI big models and distributed storage, intelligent computing centers
      are interconnected through wide area networks (WANs) to provide
      multi-DCs collaboration to compensate for the limitations of
      insufficient computing and storage resources in a single DC. The
      interconnection of artificial intelligence Data Centers (AIDCs) through
      WANs are becoming a new network structure gradually accepted by the
      industry, providing wide area lossless transmission for emerging
      application scenarios. Priority-based Flow Control(PFC)[IEEE8021Q-2022]
      technology is widely deployed in RoCEv2 networks to aviod packet loss
      caused by congestion. However, the deployment of PFC in WANs may lead to
      head-of-line blocking, deadlocks, and even congestion diffusion over a
      wider range, which will degrade network performance. On the other hand,
      WANs need to provide differentiated services for various applications,
      and there exist differences in buffering capacity from different nodes
      as well as link delay metrics between two nodes, leading to inconsistent
      parameters configuration of node, which makes network operation and
      maintenance more complicated. Therefore, PFC mechanism is not suitable
      for large-scale deployment in WANs.</t>

      <t>This document describes a solution for transparent forwarding of PFC
      PAUSE frames in wide area networks, which does not require the nodes in
      WANs to support PFC flow control capabilities. As a result, end-to-end
      flow control between AIDCs interconnected through MANs can be realized
      with minimal impact on network performance.</t>
    </section>

    <section title="Conventions">
      <section title="Requirements Language">
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
        "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
        "OPTIONAL" in this document are to be interpreted as described in BCP
        14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only
        when, they appear in all capitals, as shown here.</t>
      </section>

      <section title="Terminology">
        <t>Abbreviations used in this document:</t>

        <t>AI: Artificial Intelligence</t>

        <t>AIDC: Artificial Intelligence Data Center</t>

        <t>DC: Data Center</t>

        <t>MAC: Media Access Control</t>

        <t>P: Provider</t>

        <t>PE: Provider Edge</t>

        <t>PFC: Priority-based Flow Control</t>

        <t>RDMA: Remote Direct Memory Access</t>

        <t>RoCEv2: RDMA over Converged Ethernet version 2</t>

        <t>SR-MPLS: Segment Routing Based on Multiprotocol Label Switching</t>

        <t>SRv6: Segment Routing over IPv6</t>

        <t>VXLAN: Virtual Extensible Local Area Network</t>

        <t>WAN: Wide Area Network</t>
      </section>
    </section>

    <section title="Transparent Forwarding of PFC PAUSE Frames in WANs">
      <section title="Flow Control Mechanism For PFC Frame">
        <t>The PFC is referred to as classical stepwise back pressure with
        dedicated Ethernet pause frame, which is widely deployed in RoCEv2
        networks to aviod packet loss caused by congestion. The PFC PAUSE
        frame format is shown in Figure 1.<figure
            title="PFC PAUSE Frame Format">
            <artwork>                                                                                                                                                                                              
                                                                                    
        +--------------------------+                                                  
 6Bytes |   DMAC(0180-C200-0001)   |                                                  
        +--------------------------+                                                  
 6Bytes |  SMAC(Sender Port MAC)   |                                                  
        +--------------------------+                                                  
 2Bytes |    Ethertype(0x8808)     |                                                  
        +--------------------------+                                                  
 2Bytes |      Opcode(0x0101)      |             +----------------------------------+ 
        +--------------------------+   high 8bit |              0x00                | 
 2Bytes |    Class enable vector   | -->         +----------------------------------+ 
        +--------------------------+    low 8bit | e[7]e[6]e[5]e[4]e[3]e[2]e[1]e[0] | 
 2Bytes |      PAUSE Time[0]       |             +----------------------------------+ 
        +--------------------------+      e[n]corresponds to different priority class 
 2Bytes |      PAUSE Time[1]       |      e[n]=1,PAUSE Time valid                     
        +--------------------------+      e[n]=0,PAUSE Time invalid                   
 2Bytes |           ...            |                                                  
        +--------------------------+                                                  
 2Bytes |      PAUSE Time[7]       |                                                  
        +--------------------------+                                                  
 26Bytes|           Pad            |                                                  
        +--------------------------+                                                  
 4Bytes |           CRC            |                                                  
        +--------------------------+                                                                                                                                                                                                                                                                                  
          </artwork>
          </figure></t>

        <t>With this flow control mechanism, the congested node asks the
        directly connected upstream network node to pause the data traffic by
        a dedicated Ethernet pause frame called PFC frame, and then the
        upstream network node may stepwise ask its directly connected upstream
        network node to pause the data traffic by a PFC frame, until the most
        upstream network node may ask the directly connected traffic sender to
        pause the data traffic by a PFC frame. [IEEE8021Q-2022] details how
        this kind of flow control mechanism works.</t>

        <t>Typically, when two AIDCs are interconnected through WANs, VPN
        tunnels (e.g., SR-MPLS, SRv6, VXLAN) are established between the
        ingress PE and egress PE to carry massive RDMA traffic between DCs, as
        shown in Figure 2. <figure title="AIDCs Interconnected Through WANs">
            <artwork><![CDATA[  
   +----------+                                          +----------+
   |  AIDC 1  |                                          |  AIDC 2  |
   |          |                                          |          | 
   +----------+                                          +----------+ 
       ^                                                        |        
       |PFC Frame                                      PFC Frame|    
       |                                                        v       
   +-------+       +----+      +--------+       +----+      +--------+     
   | DC1 GW|  -->  |PE1 |  --> |P1...Pn |  -->  |PE2 |  --> | DC2 GW |                                                   
   +-------+       +----+      +--------+       +----+      +--------+
      |               |                            |               |
      |<--------------|<---------------------------|<--------------|  
         PFC Frame        PFC Frame Forwarding           PFC Frame
                                                                                                                                                                                                                                                    
        ]]></artwork>
          </figure></t>
      </section>

      <section title="PFC PAUSE Frame Processing">
        <t>When congestion occurs in the destination AIDC, the PFC frames are
        stepwise sent to the destination DC gateway. Similarly, the
        destination DC may stepwise ask its directly connected upstream egress
        PE node to pause the data traffic by sending a PFC frame. In Figure 2,
        AIDC 2 sends the PFC frames to DC2 gateway, and in turn, DC2 gateway
        sends the PFC frames to PE2 When congestion occurs at the recieved
        port.</t>

        <t>When the egress PE node of WAN receives a PFC frame, it needs to
        parse a PFC frame and determine that it is a legal PFC frame, that is,
        besides its correct frame format, its destination MAC address must be
        the multicast address: 0180-C200-0001 and the source MAC address must
        be its directly connected downstream DC gateway port MAC address (some
        vendors also use device system MAC address). Otherwise, the egress PE
        node must discard this illegal PFC frame.</t>

        <t>The egress PE node encapsulates the PFC frame based on tunnel
        encapsulation protocol, then forwards it to the immediate transit
        node, which in turn forwads it transparently to the upstream node
        until it reaches the ingress PE node.</t>

        <t>The ingress PE node decapsulates the PFC frame and replaces the
        source MAC address in the original PFC frame with the MAC address of
        its port directly connected to the source DC gateway, then forwards it
        to the source DC gateway.</t>

        <t>In order to ensure that the PFC frames can be forwarded to the
        ingress PE quickly, it is preferable to configure the highest priority
        for the encapsulated PFC frames such that the PFC frames are not
        discarded in case of network congestion.</t>

        <t>Similarly, the source DC gateway needs to parse the forwarded PFC
        frame and determine that it is a legal PFC frame, that is, besides its
        correct frame format, its destination MAC address must be the
        multicast address: 0180-C200-0001 and the source MAC address must be
        its directly connected ingress PE port MAC address(some vendors also
        use device system MAC address). Otherwise, the source DC gateway must
        discard this illegal PFC frame.</t>

        <t>the source DC gateway sends the PFC frames to the source AIDC
        (AIDC1 in Figure 2) When congestion occurs at the recieved port.
        Consequently, end-to-end flow control between AIDCs can be realized
        across WANs.</t>

        <t>An example is that two AIDCs are interconnected through SRv6 tunnel
        in WANs. The encapsulated PFC frame format is depicted as follows:</t>

        <artwork>                                                                               	                                                                                              
  +-------------------------------+ 
  |          IPv6 Header          | 
  +-------------------------------+ 
  |  IPv6 Extension Header (SRH)  | 
  +-------------------------------+ 
  |     Original PFC Frame        | 
  +-------------------------------+                                                                                                                                            
                                   </artwork>

        <t>Due to the much longer transmission distance of WANs compared to
        Internal DCs , the PFC frames forwarded from the egress PE to the ingress PE
        require a significant transmission delay. The destination DC gateway
        still needs to receive the data traffic continuously sent from the
        source DC gateway until the source DC gateway receives the PFC frames
        and pauses sending the corresponding priority data traffic. The amount
        of data received by the destination DC gateway is positively
        correlated with the transmission delay of PFC frame. To avoid packet
        loss caused by overflow in the receiving port queue, the destination
        DC gateway needs to reserve more buffer for the corresponding priority
        queue of the receiving port based on WAN transmission delay of PFC
        frame.</t>

        <t>The reserved buffer setting for the priority queue of the receiving
        port at the destination DC gateway is required to meet the following
        condition.</t>

        <t>The buffer size of the priority queue reserved for the receiving
        port &gt; (the average receiving rate of the corresponding priority
        flow at the receiving port - the average sending rate of the
        corresponding priority flow at the sending port) * the forwarding
        delay of the PFC frame from the destination DC gateway to the source
        DC gateway.</t>
      </section>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>This document has no IANA actions.</t>
    </section>

    <section anchor="scecurity" title="Security Considerations">
      <t>This document does not introduce any new security considerations.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119.xml"?>

      <?rfc include="reference.RFC.8126.xml"?>

      <?rfc include="reference.RFC.8174.xml"?>
    </references>

    <references title="Informative References">
      <?ieee include="reference.IEEE.802.1Q.2022.xml"?>
    </references>
  </back>
</rfc>