<?xml version='1.0' encoding='utf-8'?>
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" submissionType="IETF" 
docName="draft-lll-srv6ops-qp-aware-srv6-lb-00"
category="info" ipr="trust200902" obsoletes="" updates="" xml:lang="en"
symRefs="true" sortRefs="false" tocInclude="true" version="3">
  <front>
    <title abbrev="qp-based SRv6 LB">QP-based SRv6 Load Balancing Deployment</title>
    <seriesInfo name="Internet-Draft" value="draft-lll-srv6ops-qp-aware-srv6-lb-00"/>
    <author initials="Y." surname="Liu" fullname="Yisong Liu">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <street>China</street>
        </postal>
        <email>liuyisong@chinamobile.com</email>
      </address>
    </author>
	<author initials="C." surname="Lin" fullname="Changwang Lin">
      <organization>New H3C Technologies</organization>
      <address>
        <postal>
          <street>China</street>
        </postal>
        <email>linchangwang.04414@h3c.com</email>
      </address>
    </author>
	<author initials="J." surname="Li" fullname="Jiming Li">
      <organization>China Mobile</organization>
      <address>
        <postal>
          <street>China</street>
        </postal>
        <email>lijinming@chinamobile.com</email>
      </address>
    </author>
    <date year="2026"/>
    <workgroup>Network Working Group</workgroup>
    <abstract>
      <t>
      This document describes the use of Segment Routing over IPv6 (SRv6) path selection based on Queue Pair (QP)
      in Intelligent Computing Wide Area Network (WAN) for Data Center Interconnection (DCI), optimizing
      load balancing for predictable workloads.</t>
    </abstract>
  </front>
  <middle>
    <section anchor="sect-1" numbered="true" toc="default">
      <name>Introduction</name>
      <t>
      The proliferation of RDMA technology in Intelligent Computing Data Center (DC) fabrics has revolutionized
      high-performance computing, distributed storage, and machine learning workloads.</t>
      <t>
      These workloads generate large, predictable flows that demand ultra-low latency, high bandwidth,
      and precise congestion control to ensure optimal performance. Traditional networking methods,
      like hash-based Equal-Cost Multi-Path (ECMP) load balancing, struggle with insufficient entropy
      due to the low diversity of RDMA (specifically RDMA over Converged Ethernet v2, abbreviated as RoCEv2)
      <xref target="IBTA-SPEC" format="default"/>
      flow identifiers. This often results in fabric hotspots, network congestion, and performance degradation.</t>
      <t>
      The transmission process of RoCEv2 messages in intelligent computing Wide Area Network (WAN) used
      for Data Center Interconnection (DCI) is the same as inside the DC, and it will also
      generate elephant streams, which leads to fabric hotspots, network congestion, and performance degradation.</t>
      <t>
      Segment Routing over IPv6 (SRv6) <xref target="RFC8986" format="default"/> provides flexible
      traffic engineering by supporting policy-based programmability and explicit path steering.
      SRv6 policy enables deterministic path steering and fine-grained traffic control for RoCEv2 flows,
      ensuring predictable performance.</t>
      <t>
      This document details SRv6 path selection based on Queue Pair (QP) to optimize load balancing
      for predictable RoCEv2 flows in intelligent computing WAN by ensuring all packets within a QP follow the same path.</t>
    </section>
    <section anchor="sect-2" numbered="true" toc="default">
        <name>Terminology</name>
        <t>
        The following terms are used in this document:</t>
        <ul>
            <li>QP (Queue Pair): A communication endpoint in RDMA architecture,
              identified by a 24-bit or 32-bit value.</li>
            <li>BTH (Base Transport Header): The RDMA transport header containing QP information.</li>
            <li>SRv6 Policy: An ordered list of segments (SIDs) representing a path through the SRv6 network.</li>
            <li>SL (Segment List): An ordered list of SIDs in an Segment Routing Header (SRH) <xref target="RFC8754" format="default"/>.</li>
            <li>ECMP (Equal-Cost Multi-Path): A routing technique for load-balancing traffic across multiple best-path routes.</li>
        </ul>
        <section anchor="sect-2.1" numbered="true" toc="default">
          <name>Requirements Language</name>
          <t>
          The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
          "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
          "OPTIONAL" in this document are to be interpreted as described in
          BCP 14 <xref target="RFC2119" format="default"/> <xref target="RFC8174" format="default"/> 
          when, and only when, they appear in all capitals, as shown here.</t>
        </section>
    </section>
    <section anchor="sect-3" numbered="true" toc="default">
      <name>Problem Statement</name>
      <t>
        Traditional ECMP load balancing faces several challenges with RoCEv2 flow:</t>
      <ul>
          <li>Insufficient Entropy: The relatively static 5-tuple of RoCEv2 flows provides limited entropy,
              which leads to poor load balancing of elephant flows.</li>
          <li>Persistent Hotspots: When multiple elephant flows hash to the same path, they create persistent
              hotspots that cannot be resolved without manual intervention or flow termination.</li>
          <li>Poor Failure Convergence: When a link fails, all flows previously using that link are rehashed
              to remaining paths. This sudden influx of elephant flows can overwhelm the remaining links,
              causing secondary congestion.</li>
          <li>Lack of Application Awareness: ECMP operates purely on packet header fields without understanding
              the application-level semantics of QPs.</li>
      </ul>
    </section>
    <section anchor="sect-4" numbered="true" toc="default">
      <name>QP-based SRv6 Path Selection</name>
      <t>
      By encoding an ordered list of segments in the packet header, SRv6 (Policy) allows the ingress device
      to directly steer RoCEv2 workload traffic through the fabric.</t>
      <t>
      FlowSpec, as a traffic scheduling tool, can guide RoCEv2 flows to different SRv6 policies based on
      their characteristics (such as Dest QP), and forward them along different paths.
      QP-based FlowSpec protocol extensions are beyond the scope of this document.</t>
      <t>
      QP-to-SRv6 Policy Mapping:</t>
      <ul>
        <li>Upon ingress, the RoCEv2 packet is parsed to extract the destination QP identifier
            from its Base Transport Header (BTH).</li>
        <li>When multiple SRv6 policies exist, the destination QP is mapped to a corresponding SRv6 Policy
            via a pre-configured mapping table. The mapping table can rely on local configuration
            or the flowspec mechanism.</li>
      </ul>
      <t>
      Enhanced Hash-Based Segment List (SL) Scheduling:</t>
      <ul>
        <li>For the selected SRv6 Policy (which may contain multiple SLs),
            an enhanced hash algorithm, using the QP as a key input, deterministically
            selects one specific SL for use.</li>
        <li>The chosen SRv6 SL is applied to the RoCEv2 packet,
            which is then forwarded accordingly.</li>
      </ul>
    </section>
    <section anchor="sect-5" numbered="true" toc="default">
      <name>Deployment Illustration</name>
      <t>
      A typical WAN topology for DCI is shown in the figure below.</t>
      <figure anchor="figure1">
        <name>Reference Topology</name>
        <artwork name="" type="" align="center" alt=""><![CDATA[
               +-------------+
               |     DC1     |
               |             |
               |  QP1 ~ QP4  |
               +------+------+
+---------------------|---------------------+
|                +----+----+          WAN   |
|                |   PE1   |                |
|                +----+----+                |
|                     |                     |
|     +---------+-----+-----+---------+     |
|     |         |           |         |     |
|  +--+--+   +--+--+     +--+--+   +--+--+  |
|  |  P1 |   |  P2 |     |  P3 |   |  P4 |  |
|  +--+--+   +--+--+     +--+--+   +--+--+  |
|     |         |           |         |     |
|     +---------+-----+-----+---------+     |
|                     |                     |
|                +----+----+                |
|                |   PE2   |                |
|                +----+----+                |
+---------------------|---------------------+
               +------+------+
               |     DC2     |
               |             |
               |  QP1 ~ QP4  |
               +-------------+
]]></artwork>
      </figure>
      <t>
      The topology consists of two Provider Edge (PE) devices, and each of the PEs is connected to four Provider (P) devices
      and one DC.</t>
      <t>
      In this example, there are 2 DCs, in which four QPs can be established to transmit RoCEv2 workloads.</t>
      <t>
      In the above topology, there are four paths that pass through WAN from DC1 to DC2.</t>
      <t>all paths is below:</t>
      <ul>
        <li>*Path1*: PE1 -> P1 -> PE2</li>
        <li>*Path2*: PE1 -> P2 -> PE2</li>
        <li>*Path3*: PE1 -> P3 -> PE2</li>
        <li>*Path4*: PE1 -> P4 -> PE2</li>
      </ul>
      <section anchor="sect-5-1" numbered="true" toc="default">
        <name>SRv6 Policy Provisioning</name>
        <t>
        During the Day-0 cluster fabric bring-up, the topology is provisioned with SRv6 SIDs on the PE and P devices.
        These SIDs are statically configured, making them independent of any dynamic routing protocol state.</t>
        <t>
        The PE1 could create two SRv6 Policies with PE2 as the endpoint. Each SRv6 Policy contain two SLs.
        The following is provisioned:</t>
        <ul>
          <li>
            <t>SRv6 Policy 1 (low-latency):</t>
            <ul>
              <li>*SL1*: PE1 -> P1 -> PE2</li>
              <li>*SL2*: PE1 -> P2 -> PE2</li>
            </ul>
          </li>
          <li>
            <t>SRv6 Policy 2 (high-bandwidth):</t>
            <ul>
              <li>*SL1*: PE1 -> P3 -> PE2</li>
              <li>*SL2*: PE1 -> P4 -> PE2</li>
            </ul>
          </li>
        </ul>
      </section>
      <section anchor="sect-5-2" numbered="true" toc="default">
        <name>QP-based SRv6 Path Orchestration</name>
        <t>
        The fabric is now orchestrating four AI workloads. During this orchestration,
        the collective communication among DCs necessitates periodic data transmission from DC1 to DC2.</t>
        <t>
        Between DC1 to DC2, each AI workload is divided into a separate QP, and QPs are QP1, QP2, QP3, and QP4.</t>
        <t>
        During AI job computation, firstly, RoCEv2 packets are redirected to different SRv6 policies based on QP
        to achieve coarse-grained traffic classification and isolation; secondly, within a single policy,
        QP is used as a hash key for SL selection, distributing multiple QP flows evenly across multiple candidate paths (SLs)
        contained in that policy to achieve fine-grained load balancing.</t>
        <section anchor="sect-5-2-1" numbered="true" toc="default">
          <name>QP-based SRv6 Policy Mapping</name>
          <t>
          Assume that the AI ​​training task traffic carried by each QP has different requirements for link quality.
          The traffic of QP1 and QP2 requires a low-latency path (Policy 1), while the traffic of QP3 and QP4 requires
          a high-bandwidth path (Policy 2).
          On PE1, the QP-to-SRv6 Policy Mapping Table is created as shown below:</t>
          <table anchor="tbl-mapping" align="center" pn="table-1">
            <name slugifiedName="name-mapping">Mapping Table</name>
            <thead>
              <tr>
                <th align="left" colspan="1" rowspan="1">QP Range</th>
                <th align="left" colspan="1" rowspan="1">SRv6 Policy Name</th>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td align="left" colspan="1" rowspan="1">QP1, QP2</td>
                <td align="left" colspan="1" rowspan="1">Policy 1</td>
              </tr>
              <tr>
                <td align="left" colspan="1" rowspan="1">QP3, QP4</td>
                <td align="left" colspan="1" rowspan="1">Policy 2</td>
              </tr>
            </tbody>
          </table>
          <ul>
            <li>During AI job computation, for each AI job, DC1 creates a RoCEv2 packet destined for DC2.</li>
            <li>Based on the destination QP contained in each RoCEv2 packet, each RoCEv2 packet received by PE1
                is mapped to its corresponding SRv6 Policy according to the table 1.</li>
          </ul>
        </section>
        <section anchor="sect-5-2-2" numbered="true" toc="default">
          <name>QP-based SL Orchestration</name>
          <t>
            In selected SRv6 Policy for each RoCEv2 packet, QP-based hash algorithm is used to
            select one specific SL for fowarding the RoCEv2 packet, as shown below:</t>
          <ul>
            <li>QP1 Packet: SRv6 Policy 1 -> SL1 (PE1->P1->PE2)</li>
            <li>QP2 Packet: SRv6 Policy 1 -> SL2 (PE1->P2->PE2)</li>
            <li>QP3 Packet: SRv6 Policy 2 -> SL1 (PE1->P3->PE2)</li>
            <li>QP4 Packet: SRv6 Policy 2 -> SL2 (PE1->P4->PE2)</li>
          </ul>
          <t>
            PE1 will encapsulate each RoCEv2 packet with an outer IPv6 header and SRH using the selected SL,
            and then forward it to the appropriate link.</t>
          <t>
            The PE1->P1 link carries the traffic of QP1, the PE1->P2 link carries the traffic of QP2,
            the PE1->P3 link carries the traffic of QP3, and the PE1->P4 link carries the traffic of QP4.</t>
        </section>
      </section>
    </section>
    <section anchor="sect-6" numbered="true" toc="default">
      <name>Operational Considerations</name>
      <t>
      In ingress device, the control plane must support QP range to SRv6 Policy mapping by protocol extension
      or local configuration. For non-RoCEv2 traffic, the system MUST revert to the standard five-tuple hash for SL
      selection.</t>
      <t>
      The ingress devices require deep packet inspection capability to parse BTH headers,
      programmable hash engines with configurable input fields, sufficient TCAM/SRAM for QP
      classification mapping tables, and support for multiple active SRv6 policies with multiple SLs.</t>
      <t>
      When network congestion or failure occurs, operators can flexibly configure QP range to SRv6 Policy mapping strategies
      on the ingress device to guide RoCEv2 flows to the appropriate path.</t>
    </section>
    <section anchor="sect-7" numbered="true" toc="default">
      <name>Security Considerations</name>
      <t>
      Malicious actors could spoof QP values to bypass mapping policies,
      cause hash collisions, or exhaust specific network paths. Mitigations may include
      cryptographic validation of RoCEv2 packets, and QP whitelisting/blacklisting.</t>
      <t>
      QP values may reveal application-level information, so QP values SHOULD be anonymized or encrypted.</t>
      <t>
      The additional packet processing (such as parsing BTH headers) could be exploited for
      Denial of Service (DoS) attacks; therefore, implementations MUST support graceful degradation
      mechanisms (such as rate limiting) under attack.</t>
    </section>
    <section anchor="sect-8" numbered="true" toc="default">
      <name>IANA Considerations</name>
      <t>
      This document has no IANA actions.</t>
    </section>
  </middle>
  <back>
    <references title="References">
    <references title="Normative References">
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8754.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8986.xml"/>
    <!--<reference anchor="I-D.ietf-spring-srv6-srh-compression" target="https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression-23" xml:base="https://bib.ietf.org/public/rfc/bibxml-ids/reference.I-D.ietf-spring-srv6-srh-compression.xml">
          <front>
            <title>Compressed SRv6 Segment List Encoding (CSID)</title>
            <author fullname="Weiqiang Cheng" initials="W." role="editor" surname="Cheng"></author>
            <author fullname="Clarence Filsfils" initials="C." surname="Filsfils"></author>
            <author fullname="Zhenbin Li" initials="Z." surname="Li"></author>
            <author fullname="Bruno Decraene" initials="B." surname="Decraene"></author>
            <author fullname="Francois Clad" initials="F." role="editor" surname="Clad"></author>
            <date month="February" year="2025"/>
          </front>
          <seriesInfo name="Internet-Draft" value="draft-ietf-spring-srv6-srh-compression-23"/>
        </reference>-->
    </references>
    <references title="Informative References">
    <!--<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.4655.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9514.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.9602.xml"/>-->
        <reference anchor="IBTA-SPEC" target="https://www.infinibandta.org/ibta-specification/">
          <front>
            <title>InfiniBand Architecture Specification</title>
            <author>
              <organization>InfiniBand Trade Association</organization>
            </author>
            <date year="2023" month="December"/>
          </front>
          <seriesInfo name="InfiniBand Architecture Specification" value="Volume 1-2, Release 1.6"/>
        </reference>
    </references>
  </references>
  </back>
</rfc>
