<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>

<rfc xmlns:xi="http://www.w3.org/2001/XInclude"
     category="info"
     docName="draft-li-rtgwg-congestion-aware-flowset-switching-00"
     ipr="trust200902"
     submissionType="IETF"
     consensus="true"
     xml:lang="en"
     version="3">

  <front>
    <title abbrev="Congestion-Aware Flow Switching">Congestion-Aware Adaptive Flow Table Switching for ECMP</title>

    <seriesInfo name="Internet-Draft" value="draft-li-rtgwg-congestion-aware-flowset-switching-00"/>

    <author fullname="Zhiqiang Li" initials="Z." surname="Li">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <street>32 Xuanwumen West Street</street>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>lizhiqiangyjy@chinamobile.com</email>
      </address>
    </author>
    <author fullname="Zongpeng Du" initials="Z." surname="Du">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <street>32 Xuanwumen West Street</street>
          <city>Beijing</city>
          <code>100053</code>
          <country>China</country>
        </postal>
        <email>duzongpeng@chinamobile.com</email>
      </address>
    </author>
    <author fullname="Wei Cheng" initials="W." surname="Cheng">
      <organization>Centec Networks</organization>
      <address>
        <postal>
          <city>Suzhou</city>
          <code>215000</code>
          <country>China</country>
        </postal>
        <email>chengw@centec.com</email>
      </address>
    </author>
    <author fullname="Junjie Wang" initials="J." surname="Wang">
      <organization>Centec Networks</organization>
      <address>
        <postal>
          <city>Suzhou</city>
          <code>215000</code>
          <country>China</country>
        </postal>
        <email>wangjj@centec.com</email>
      </address>
    </author>
    <author fullname="Guoying Zhang" initials="G." surname="Zhang">
      <organization>Centec Networks</organization>
      <address>
        <postal>
          <city>Suzhou</city>
          <code>215000</code>
          <country>China</country>
        </postal>
        <email>zhanggy@centec.com</email>
      </address>
    </author>

<date year="2026" month="February" day="28"/>

    <area>Routing</area>
    <workgroup>RTGWG</workgroup>

    <keyword>ECMP</keyword>
    <keyword>Load Balancing</keyword>
    <keyword>Congestion</keyword>
    <keyword>Flow Table</keyword>
    <keyword>Adaptive</keyword>

    <abstract>
      <t>This document defines a congestion-aware adaptive flow table
      switching mechanism for Equal-Cost Multi-Path (ECMP) routing. The
      mechanism periodically assesses the congestion state of egress ports
      and progressively adjusts flow table mappings based on quantified
      congestion levels. This addresses the port congestion issues that
      occur in traditional ECMP load balancing when traffic patterns
      change suddenly or multicast traffic is present, while maintaining
      packet ordering within flows.</t>
    </abstract>
  </front>

  <middle>
    <section numbered="true" toc="default">
      <name>Introduction</name>
      <t>Equal-Cost Multi-Path (ECMP) routing is a widely deployed load
      balancing technology in data center networks <xref target="RFC2991"/>.
      Traditional ECMP distributes traffic across multiple equal-cost paths
      by hashing packet header fields, typically the five-tuple. To ensure
      packet ordering within a flow, the mapping between a flow and its
      egress port typically remains unchanged throughout the flow's
      lifetime.</t>

      <t>However, this static mapping approach exhibits significant
      limitations in the following scenarios:</t>

      <t>Traffic Surge Scenario: Network traffic is highly dynamic and may
      cause sudden increases on certain ports. The flow table mapping
      cannot be adjusted in time to alleviate congestion.</t>

      <t>Multicast Traffic Scenario: The replication characteristics of
      multicast traffic may cause it to concentrate on a small number of
      ports, exacerbating load imbalance.</t>

      <t>Existing congestion response strategies typically adopt two extreme
      approaches: either no switching (maintaining the original mapping
      until flow aging) or full switching (simultaneously migrating all
      flows on a congested port). The former cannot respond to congestion
      in a timely manner, while the latter may cause congestion transfer
      and resource fluctuations.</t>

      <t>This document defines a congestion-aware adaptive flow table
      switching mechanism that quantifies port congestion levels and
      progressively adjusts flow table mappings to achieve dynamic
      optimization of load balancing while preserving packet ordering.</t>
    </section>

    <section numbered="true" toc="default">
      <name>Terminology and Conventions</name>

      <section numbered="true" toc="default">
        <name>Requirements Language</name>
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
        "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
        "OPTIONAL" in this document are to be interpreted as described in
        BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and
        only when, they appear in all capitals, as shown here.</t>
      </section>

      <section numbered="true" toc="default">
        <name>Definitions</name>
        <dl>
          <dt>ECMP (Equal-Cost Multi-Path):</dt>
          <dd>A routing strategy that distributes traffic across multiple
          paths of equal cost.</dd>

          <dt>Flow Table:</dt>
          <dd>A data structure that stores the mapping between flow
          identifiers and egress ports, ensuring that packets of the same
          flow are forwarded from the same port.</dd>

          <dt>Congestion Quantification Index (CQI):</dt>
          <dd>A quantified value representing the degree of port congestion,
          ranging from 0 to a configured maximum. A CQI of 0 indicates no
          congestion.</dd>

          <dt>Assessment Interval:</dt>
          <dd>The time interval for port congestion state assessment.</dd>

          <dt>Flow Table Migration:</dt>
          <dd>The operation of remapping a flow table entry from one egress
          port to another.</dd>
        </dl>
      </section>
    </section>

    <section numbered="true" toc="default">
      <name>Problem Statement</name>

      <section numbered="true" toc="default">
        <name>Limitations of Traditional ECMP</name>
        <t>Traditional ECMP load balancing uses static hash mapping. Once a
        flow is assigned to a port, the mapping remains unchanged throughout
        the flow's lifetime. This design has the following deficiencies:</t>

        <t>Delayed Response: When a port becomes congested, flows already
        mapped to that port cannot be migrated in time, causing congestion
        to persist.</t>

        <t>Load Imbalance: The randomness of traffic and the presence of
        elephant flows may cause severe load imbalance between ports.</t>
      </section>

      <section numbered="true" toc="default">
        <name>Inadequacy of Existing Solutions</name>
        <t>Flowlet Switching: This mechanism switches based on inter-packet
        gaps within a flow and relies on manually configured time thresholds.
        If the threshold is too large, it degrades to traditional ECMP; if
        too small, it may cause packet reordering.</t>

        <t>Full-Switch Strategy: Migrating all relevant flows simultaneously
        when congestion is detected may cause the target port to be instantly
        overloaded, resulting in congestion transfer.</t>
      </section>

      <section numbered="true" toc="default">
        <name>Requirements Summary</name>
        <t>A mechanism is needed that can:</t>
        <ol>
          <li>Perceive port congestion state in real-time</li>
          <li>Progressively adjust flow table mappings based on congestion
          level</li>
          <li>Avoid congestion transfer and resource fluctuation</li>
          <li>Preserve packet ordering</li>
        </ol>
      </section>
    </section>

    <section numbered="true" toc="default">
      <name>Solution Overview</name>
      <t>This mechanism defines two core functional components:</t>

      <t>Port Congestion Assessment: Periodically assesses the congestion
      state of each egress port and generates a Congestion Quantification
      Index (CQI).</t>

      <t>Adaptive Flow Table Migration: Progressively migrates flow table
      entries from congested ports to less loaded ports based on the CQI
      value.</t>

      <t>The fundamental design principle is that the higher the CQI value, the more flow table
      entries are allowed to migrate in the current assessment interval.
      For each entry migrated, the CQI is decremented by 1 until the CQI
      reaches zero or no more entries need migration.</t>
    </section>

    <section numbered="true" toc="default">
      <name>Protocol Specification</name>

      <section numbered="true" toc="default">
        <name>Port Congestion Assessment</name>

        <section numbered="true" toc="default">
          <name>Assessment Interval</name>
          <t>Implementations MUST support a configurable assessment interval.
          The RECOMMENDED default value is between 10ms and 100ms.</t>

          <t>Implementations MAY adaptively adjust the assessment interval
          based on overall traffic levels: shortening the interval during
          high traffic to improve responsiveness, and lengthening it during
          low traffic to reduce overhead.</t>
        </section>

        <section numbered="true" toc="default">
          <name>Congestion Quantification Index Calculation</name>
          <t>CQI calculation SHOULD be based on one or more of the following
          metrics: port egress queue depth, port buffer utilization, and
          port packet drop counter increment.</t>

          <t>The CQI value range is 0 to CQI_MAX. The RECOMMENDED value for
          CQI_MAX is 16.</t>

          <t>The recommended CQI calculation method is:</t>

          <artwork type="pseudocode"><![CDATA[
CQI = min(CQI_MAX, floor(queue_depth / congestion_threshold))
          ]]></artwork>

          <t>where congestion_threshold is the congestion determination
          threshold, RECOMMENDED to be 10% of queue capacity.</t>
        </section>

        <section numbered="true" toc="default">
          <name>State Advertisement</name>
          <t>At the end of each assessment interval, the Port Congestion
          Assessment component MUST synchronize each port's CQI value to
          the Flow Table Migration component.</t>
        </section>
      </section>

      <section numbered="true" toc="default">
        <name>Adaptive Flow Table Migration</name>

        <section numbered="true" toc="default">
          <name>Migration Decision</name>
          <t>When a packet arrives, implementations MUST process it according
          to the following rules:</t>

          <t>Rule 1 (Flow Table Does Not Exist): Perform normal flow table
          learning and select the port with the lightest current load.</t>

          <t>Rule 2 (Port Failure): If the flow table exists but the
          corresponding port is unavailable, a new port MUST be selected.</t>

          <t>Rule 3 (No Congestion): If the flow table exists and the
          corresponding port's CQI is 0, the implementation MUST continue
          using the current port and MUST NOT perform migration.</t>

          <t>Rule 4 (Congestion Exists): If the flow table exists and the
          corresponding port's CQI is greater than 0, the implementation
          SHOULD perform flow table migration.</t>
        </section>

        <section numbered="true" toc="default">
          <name>Migration Operation</name>
          <t>When migration is triggered, implementations MUST perform the
          following steps:</t>

          <t>Step 1: Select the port with the smallest CQI from all available
          ports as the target. If multiple candidate ports have the same CQI,
          implementations MAY use random selection or round-robin.</t>

          <t>Step 2: Update the flow table entry's egress port to the target
          port.</t>

          <t>Step 3: Decrement the original port's CQI by 1.</t>
        </section>

        <section numbered="true" toc="default">
          <name>Migration Quantity Control</name>
          <t>A key property of this mechanism is that the migration
          quantity is proportional to the congestion level. When the CQI
          value is high, more flow table entries may be migrated within a
          single assessment interval. When the CQI value is low, the
          migration quantity decreases accordingly.</t>

          <t>Implementations MUST ensure that within a single assessment
          interval, the number of flow table entries migrated from a port
          does not exceed that port's initial CQI value.</t>
        </section>

        <section numbered="true" toc="default">
          <name>Continuous Migration</name>
          <t>If the CQI does not drop to 0 within an assessment interval,
          subsequent assessment intervals will recalculate the CQI. If
          congestion persists, migration will continue; if congestion is
          alleviated, migration will decrease or stop.</t>
        </section>
      </section>
    </section>

    <section numbered="true" toc="default">
      <name>Data Structures</name>

      <section numbered="true" toc="default">
        <name>Flow Table Entry</name>
        <t>A flow table entry MUST contain the following fields: flow
        identifier (obtained through hash calculation), egress port
        identifier, valid bit, and timestamp (for aging).</t>
      </section>

      <section numbered="true" toc="default">
        <name>Port Status Table</name>
        <t>The port status table MUST contain the following fields: port
        identifier, port status (UP/DOWN), current CQI value, and queue
        depth.</t>
      </section>
    </section>

    <section numbered="true" toc="default">
      <name>Operational Procedures</name>

      <section numbered="true" toc="default">
        <name>Initialization</name>
        <t>Implementations MUST perform the following at startup:</t>
        <ol>
          <li>Clear the flow table</li>
          <li>Initialize all ports' CQI to 0</li>
          <li>Start the periodic assessment task</li>
        </ol>
      </section>

      <section numbered="true" toc="default">
        <name>Packet Processing</name>
        <t>The packet processing flow is as follows:</t>
        <ol>
          <li>Packet arrives</li>
          <li>Calculate flow identifier</li>
          <li>Query flow table</li>
          <li>If flow table does not exist: learn new entry, select lightest
          loaded port</li>
          <li>If flow table exists: check port status and CQI, perform
          migration if needed</li>
          <li>Forward packet</li>
        </ol>
      </section>
    </section>

    <section numbered="true" toc="default">
      <name>Relationship with Existing Mechanisms</name>

      <section numbered="true" toc="default">
        <name>Relationship with ECMP</name>
        <t>This mechanism is an enhancement extension to traditional ECMP,
        adding congestion awareness and adaptive migration capabilities on
        top of ECMP. Implementations MAY overlay this mechanism on existing
        ECMP implementations.</t>
      </section>

      <section numbered="true" toc="default">
        <name>Relationship with Flowlet</name>
        <t>This mechanism MAY be used in conjunction with flowlet switching.
        Flowlet uses inter-packet gaps within a flow for switching, while
        this mechanism uses port congestion state to trigger switching. The
        two can be complementary.</t>
      </section>

      <section numbered="true" toc="default">
        <name>Relationship with Congestion Control</name>
        <t>This mechanism operates at the forwarding layer and is orthogonal
        to end-to-end congestion control mechanisms such as ECN and DCQCN.
        Implementations SHOULD consider coordination with congestion control
        mechanisms.</t>
      </section>
    </section>

    <section numbered="true" toc="default">
      <name>Security Considerations</name>

      <section numbered="true" toc="default">
        <name>Denial of Service Risk</name>
        <t>Attackers may induce frequent migration by forging traffic,
        consuming device resources.</t>

        <t>Mitigation Measures: Implementations SHOULD set a maximum number
        of migrations per unit time. Implementations SHOULD use smoothing
        algorithms for CQI calculation to avoid overreaction to instantaneous
        fluctuations.</t>
      </section>

      <section numbered="true" toc="default">
        <name>Information Disclosure Risk</name>
        <t>CQI values and migration decisions may reveal network topology or
        traffic pattern information.</t>

        <t>Mitigation Measures: Implementations MUST implement access control
        for related data. Inter-module communication SHOULD use security
        mechanisms.</t>
      </section>

      <section numbered="true" toc="default">
        <name>Configuration Integrity</name>
        <t>Mitigation Measures: Implementations MUST ensure configuration
        parameter integrity. Implementations SHOULD log configuration
        changes.</t>
      </section>
    </section>

    <section numbered="true" toc="default">
      <name>IANA Considerations</name>
      <t>This document does not require IANA to allocate any resources.</t>
    </section>
  </middle>

  <back>
    <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        <reference anchor="RFC2119" target="https://www.rfc-editor.org/info/rfc2119">
          <front>
            <title>Key words for use in RFCs to Indicate Requirement Levels</title>
            <author fullname="S. Bradner" initials="S." surname="Bradner"/>
            <date month="March" year="1997"/>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="2119"/>
          <seriesInfo name="DOI" value="10.17487/RFC2119"/>
        </reference>

        <reference anchor="RFC8174" target="https://www.rfc-editor.org/info/rfc8174">
          <front>
            <title>Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words</title>
            <author fullname="B. Leiba" initials="B." surname="Leiba"/>
            <date month="May" year="2017"/>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="8174"/>
          <seriesInfo name="DOI" value="10.17487/RFC8174"/>
        </reference>
      </references>

      <references>
        <name>Informative References</name>
        <reference anchor="RFC2991" target="https://www.rfc-editor.org/info/rfc2991">
          <front>
            <title>Multipath Issues in Unicast and Multicast Next-Hop Selection</title>
            <author fullname="D. Thaler" initials="D." surname="Thaler"/>
            <author fullname="C. Hopps" initials="C." surname="Hopps"/>
            <date month="November" year="2000"/>
          </front>
          <seriesInfo name="RFC" value="2991"/>
          <seriesInfo name="DOI" value="10.17487/RFC2991"/>
        </reference>

        <reference anchor="RFC6438" target="https://www.rfc-editor.org/info/rfc6438">
          <front>
            <title>Using the IPv6 Flow Label for Equal Cost Multipath Routing and Link Aggregation in Tunnels</title>
            <author fullname="B. Carpenter" initials="B." surname="Carpenter"/>
            <author fullname="S. Amante" initials="S." surname="Amante"/>
            <date month="November" year="2011"/>
          </front>
          <seriesInfo name="RFC" value="6438"/>
          <seriesInfo name="DOI" value="10.17487/RFC6438"/>
        </reference>

        <reference anchor="RFC7098" target="https://www.rfc-editor.org/info/rfc7098">
          <front>
            <title>Using the IPv6 Flow Label for Load Balancing in Server Farms</title>
            <author fullname="B. Carpenter" initials="B." surname="Carpenter"/>
            <author fullname="S. Jiang" initials="S." surname="Jiang"/>
            <author fullname="W. Tarreau" initials="W." surname="Tarreau"/>
            <date month="January" year="2014"/>
          </front>
          <seriesInfo name="RFC" value="7098"/>
          <seriesInfo name="DOI" value="10.17487/RFC7098"/>
        </reference>
      </references>
    </references>
  </back>
</rfc>